Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems
A 7GB benchmark shows Delta Lake loads fastest, Iceberg saves space.
A recent academic paper by Ivan Borodii and Halyna Osukhivska provides a rigorous, head-to-head comparison of the three most popular Data Lakehouse systems—Apache Hudi, Apache Iceberg, and Delta Lake—using Apache Spark as the distributed processing engine. The study, published on arXiv, tested each system's ability to handle structured (CSV) and semi-structured (JSON) data across varying sizes, including files up to 7 GB. The researchers developed four sequential ETL processes (read, transform, load) and evaluated performance based on two key criteria: data loading time and the resulting storage footprint on disk.
The experimental results offer clear, actionable guidance for data engineers and architects. Delta Lake emerged as the most optimal architecture when speed is the priority, consistently delivering the fastest data loading times regardless of data volume or format. Conversely, Apache Iceberg proved the best choice for systems where stability and disk space savings are critical, using significantly less storage than its competitors. Apache Hudi underperformed in both loading speed and storage efficiency in this benchmark, suggesting its strengths lie in other use cases like incremental updates and streaming processing. This is the first known comparison of these three systems specifically for selecting an architecture for analytical data systems.
- Delta Lake is optimal for speed: fastest data loading for any volume up to 7 GB.
- Apache Iceberg is best for storage efficiency: uses less disk space, offering stability.
- Apache Hudi lagged in loading and storage but may suit incremental updates and streaming.
Why It Matters
This benchmark gives data engineers a data-driven framework to choose between speed (Delta Lake) and storage efficiency (Iceberg) for Lakehouse architectures.