GitHub - cwida/FastLanes: Next-Gen Big Data File Format
Service

GitHub - cwida/FastLanes: Next-Gen Big Data File Format

cwida
2026.01.04
·GitHub·by 이호민
#Big Data#File Format#Compression#SIMD#Data Processing

Key Points

  • 1FastLanes is an open-source, next-generation big data file format designed for modern data-parallel execution, boasting 40% better compression and 40x faster decoding than Parquet.
  • 2It achieves high performance through SIMD-friendly, lightweight encodings, multi-column compression, and flexible partial decompression, while having zero external dependencies.
  • 3The project is actively developed with support for Python, C++, and Rust, and its underlying innovations are detailed in multiple academic publications since 2023.

FastLanes is an open-source, next-generation big data file format designed for modern data-parallel execution environments, specifically targeting SIMD and GPU architectures. It aims to improve upon existing columnar formats like Parquet, claiming 40% better compression and 40 times faster decoding. The core design philosophy centers on achieving high decompression speeds with scalar code while ensuring SIMD-friendliness without requiring explicit SIMD instructions, relying on auto-vectorization.

The methodology of FastLanes distinguishes itself by moving away from generic compression methods (e.g., Snappy) in favor of lightweight, data-parallel encodings. To achieve superior compression ratios, FastLanes employs a flexible *expression encoding mechanism* that allows for the cascading of multiple encodings. This mechanism is crucial for enabling *multi-column compression (MCC)*, which leverages correlations between different columns to enhance compression effectiveness—a known challenge in traditional columnar storage. A key technical contribution is a 2-phase algorithm specifically designed to discover optimal encoding expressions during the compression process.

Beyond its compression scheme, FastLanes innovates in its application programming interface (API) by providing robust support for *partial decompression*. This feature allows data processing engines to execute queries directly on compressed data, minimizing the need for full materialization. The format is designed for fine-grained access, operating at the level of small batches rather than large row groups, which helps limit the decompression memory footprint to fit within CPU and GPU caches, thereby improving performance by reducing memory contention and leveraging cache locality.

The underlying implementation is in portable C++, which is optimized for auto-vectorization. FastLanes also provides high-level bindings for Python and Rust. Performance evaluations on real-world datasets indicate that FastLanes not only achieves better compression ratios than state-of-the-art formats like Parquet but also significantly accelerates decompression, positioning it as a "win-win" solution.

Further technical aspects and related research include:

  • Scalar Decoding Speed: Demonstrated capability to decode over 100 billion integers per second using scalar code.
  • GPU Acceleration: Specific designs and optimizations for accelerating data processing on GPUs, including lightweight encodings tailored for GPU architectures.
  • Floating-Point Compression: Development of Adaptive Lossless Floating-Point Compression (ALP), a technique for compressing floating-point data losslessly.