How fast are Linux pipes anyway?
Key Points
- 1This post details optimizing Linux pipe throughput from an initial 3.5 GiB/s to over 30 GiB/s by iteratively refining a test program.
- 2Initial performance bottlenecks were identified as excessive data copying between user and kernel space when using `write` and `read` syscalls.
- 3Significant improvements were achieved by leveraging `vmsplice` and `splice` syscalls to eliminate data copying, followed by an investigation into page table walking overhead, which huge pages are expected to address further.
The paper details an iterative optimization process to improve the throughput of Linux pipes, aiming to match and exceed the performance of a highly optimized FizzBuzz program, which achieves approximately 35 GiB/s. The methodology involves profiling a test program using Linux's perf tooling to identify bottlenecks and applying kernel-level optimizations.
Initially, a baseline program using standard write and read syscalls achieves a throughput of about 3.5 GiB/s for 256 KiB blocks. perf analysis reveals that nearly 50% of the time is spent within pipe_write, with significant portions dedicated to copy_page_from_iter and __alloc_pages.
The internal structure of a Linux pipe is described as a ring buffer of struct pipe_buffer entries, each referencing a struct page (a 4KiB physical memory page on x86-64), along with an offset and len within that page. A pipe has a default capacity of 16 pages (64 KiB), and pipe_write and pipe_read operate by:
- Copying data from user-space memory to kernel pages (
copy_page_from_iter). - Allocating new kernel pages (
__alloc_pages) if existing ones are full or unsuitable. - Managing the ring buffer (e.g., advancing
headandtailpointers). - Acquiring and releasing a pipe-specific lock for synchronization.
The first major optimization introduces zero-copy mechanisms using the vmsplice and splice syscalls. vmsplice moves data from user memory to a pipe, while splice moves data between two file descriptors (e.g., from a pipe to /dev/null for reading), both without data copying. The vmsplice syscall takes an array of struct iovec (virtual memory buffers) and directly "splices" the underlying physical pages into the pipe's ring buffer.
The implementation details for vmsplice involve a double-buffering scheme:
- The 256 KiB output buffer is split into two 128 KiB halves.
- The pipe size is set to 128 KiB (32 pages).
- The program alternates writing to one half-buffer and calling
vmspliceto move it into the pipe, ensuring that by the time the next half-buffer is ready, the previous one has been fully consumed by the reader (due to the pipe's bounded size), preventing race conditions on shared memory pages.
Replacing write with vmsplice for the writer improves throughput to 12.7 GiB/s. Subsequently, replacing read with splice for the reader eliminates all copying, boosting throughput to 32.8 GiB/s.
Further perf analysis after splicing identifies iov_iter_get_pages and __mutex_lock (pipe locking) as the next major bottlenecks within vmsplice. The paper then delves into the Linux paging mechanism to explain iov_iter_get_pages. Processes use virtual memory addresses, which the CPU translates to physical addresses using a hierarchical page table (e.g., a 4-level tree on x86-64, consisting of PGD, PUD, PMD, and PTE). Each node is 4KiB, with 8-byte entries pointing to the next level. The CR3 register points to the root of the current process's page table. struct page is the kernel's representation of a physical memory page, holding its address and metadata (like reference counts).
iov_iter_get_pages converts virtual memory ranges (from struct iovec) into struct page references. Its core is get_user_pages_fast, which performs a software walk of the page table to find the physical pages backing a virtual memory range. For a 128 KiB buffer, 32 struct page entries are retrieved. get_user_pages_fast has a "fast path" if page table entries (PTEs) already exist (e.g., after a memset on the buffer, which faults in the pages) and a slower path for creating new PTEs or handling non-present pages. This function also increments reference counts on the struct pages to prevent premature deallocation. The time spent in this function is primarily due to the page table walk for each 4KiB page and managing struct page reference counts.