ACEAPEX breaks LZ77 decoding bottleneck with 3.1x speedup over zstd
Parallel LZ77 decoder hits 44 GB/s on GPU and 10,869 MB/s on CPU
ACEAPEX, a novel parallel LZ77 codec from researcher Yakiv Shavidze, tackles the fundamental sequential bottleneck in LZ77 decoding. Traditional LZ77 decoders must process back-references sequentially because each reference depends on previously decompressed data, preventing multi-core scaling. ACEAPEX solves this by converting all back-references to absolute positions in the decompressed output during encoding and organizing data into self-contained 1 MB blocks. This enables embarrassingly parallel block-level decoding on any number of cores.
Performance benchmarks are striking. Integrated into lzbench, ACEAPEX achieves 10,160 MB/s on an AMD EPYC 4344P (8 cores) and 10,869 MB/s on an EPYC 9575F for FASTQ genomic data — up to 3.1x faster than zstd -3 at comparable compression ratios. On the GPU side, a wavefront decoder on NVIDIA H100 SXM measures 44.0 GB/s on enwik9 and 20.3 GB/s on FASTQ (bit-perfect verified). With a depth-limited encoder variant that sacrifices only 1.5% ratio on enwik9, GPU throughput reaches 77.2 GB/s on a single H100 and an astonishing 249.9 GB/s on two H100s in NVLink configuration. This is the first reported GPU LZ77 decode achieving near-standard compression ratio with byte-for-byte verification.
- ACEAPEX stores LZ77 back-references as absolute positions, enabling parallel block-level decoding in 1 MB blocks
- CPU performance reaches 10,869 MB/s on EPYC 9575F — up to 3.1x faster than zstd -3 at similar ratios
- GPU wavefront decoder hits 77.2 GB/s on single H100, 249.9 GB/s on two H100s via NVLink (depth-limited encoder)
Why It Matters
Breaks LZ77's sequential decode wall, unlocking GPU-scale decompression for large datasets like genomics and logs.