GitHub - deepseek-ai/smallpond: A lightweight data processing framework built on DuckDB and 3FS.
Service

GitHub - deepseek-ai/smallpond: A lightweight data processing framework built on DuckDB and 3FS.

deepseek-ai
2025.03.08
ยทGitHubยทby Anonymous
#Data Processing#DuckDB#3FS#Framework#Python

Key Points

  • 1Smallpond is a lightweight, high-performance data processing framework built on DuckDB and 3FS, designed for PB-scale datasets.
  • 2It features easy operations with no long-running services, allowing users to process data, including repartitioning and SQL queries, with a simple API.
  • 3The framework demonstrated exceptional performance, sorting 110.5 TiB of data in just over 30 minutes on the GraySort benchmark, achieving an average throughput of 3.66 TiB/min.

smallpond is presented as a lightweight data processing framework designed for high-performance and scalability, capable of handling petabyte-scale datasets. It distinguishes itself by requiring no long-running services, simplifying operations. The framework is fundamentally built upon two core technologies: DuckDB for analytical processing and 3FS for distributed, scalable storage.

The core methodology of smallpond involves orchestrating data processing workflows by leveraging the in-process, columnar OLAP capabilities of DuckDB on data residing in 3FS. Data is accessed and manipulated through a programmatic interface. Upon initializing a session, users can read data, such as Parquet files, into a smallpond DataFrame object. For parallel and distributed processing, smallpond facilitates data repartitioning. For instance, the repartition(npartitions,hashby="columnname")repartition(n_partitions, hash_by="column_name") method distributes data across a specified number of partitions by hashing a designated column, a common strategy to prepare data for efficient group-by or join operations in a distributed context.

The computational work is primarily performed by executing SQL queries via partial_sql(). This method enables applying SQL statements to the smallpond DataFrame, which likely translates to DuckDB executing these queries on individual data partitions or chunks. The {0} placeholder in the SQL string indicates the input table name for the specific DuckDB instance processing a partition. This design suggests a "shared-nothing" or "shared-disk" architecture where DuckDB instances process their respective data segments, with 3FS providing the shared, high-throughput storage layer. After processing, results can be written back to a file system, again leveraging 3FS for scalable output, or converted to in-memory Pandas DataFrames for immediate inspection.

Performance is a key highlight, demonstrated through the GraySort benchmark. smallpond was evaluated on a cluster composed of 50 compute nodes and 25 storage nodes running 3FS. In this benchmark, the framework successfully sorted 110.5 TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66 TiB/min. This benchmark underscores smallpond's efficiency in large-scale data rearrangement and processing, attributed to the combination of DuckDB's query optimization and 3FS's distributed I/O capabilities.