ACEAPEX stores LZ77 back-references as absolute positions, enabling parallel block-level decoding in 1 MB blocks?

ACEAPEX stores LZ77 back-references as absolute positions, enabling parallel block-level decoding in 1 MB blocks

CPU performance reaches 10,869 MB/s on EPYC 9575F — up to 3.1x faster than zstd -3 at similar ratios?

CPU performance reaches 10,869 MB/s on EPYC 9575F — up to 3.1x faster than zstd -3 at similar ratios

GPU wavefront decoder hits 77.2 GB/s on single H100, 249.9 GB/s on two H100s via NVLink (depth-limited encoder)?

GPU wavefront decoder hits 77.2 GB/s on single H100, 249.9 GB/s on two H100s via NVLink (depth-limited encoder)

Research & Papers

ACEAPEX breaks LZ77 decoding bottleneck with 3.1x speedup over zstd

arXiv cs.DC June 04, 2026

⚡Parallel LZ77 decoder hits 44 GB/s on GPU and 10,869 MB/s on CPU

Deep Dive

ACEAPEX, a novel parallel LZ77 codec from researcher Yakiv Shavidze, tackles the fundamental sequential bottleneck in LZ77 decoding. Traditional LZ77 decoders must process back-references sequentially because each reference depends on previously decompressed data, preventing multi-core scaling. ACEAPEX solves this by converting all back-references to absolute positions in the decompressed output during encoding and organizing data into self-contained 1 MB blocks. This enables embarrassingly parallel block-level decoding on any number of cores.

Performance benchmarks are striking. Integrated into lzbench, ACEAPEX achieves 10,160 MB/s on an AMD EPYC 4344P (8 cores) and 10,869 MB/s on an EPYC 9575F for FASTQ genomic data — up to 3.1x faster than zstd -3 at comparable compression ratios. On the GPU side, a wavefront decoder on NVIDIA H100 SXM measures 44.0 GB/s on enwik9 and 20.3 GB/s on FASTQ (bit-perfect verified). With a depth-limited encoder variant that sacrifices only 1.5% ratio on enwik9, GPU throughput reaches 77.2 GB/s on a single H100 and an astonishing 249.9 GB/s on two H100s in NVLink configuration. This is the first reported GPU LZ77 decode achieving near-standard compression ratio with byte-for-byte verification.

Key Points

ACEAPEX stores LZ77 back-references as absolute positions, enabling parallel block-level decoding in 1 MB blocks
CPU performance reaches 10,869 MB/s on EPYC 9575F — up to 3.1x faster than zstd -3 at similar ratios
GPU wavefront decoder hits 77.2 GB/s on single H100, 249.9 GB/s on two H100s via NVLink (depth-limited encoder)

Why It Matters

Breaks LZ77's sequential decode wall, unlocking GPU-scale decompression for large datasets like genomics and logs.

Read Original Article

ACEAPEX breaks LZ77 decoding bottleneck with 3.1x speedup over zstd

Why It Matters

Related Articles

🚀 Stay Ahead in AI