MaxSketch slashes memory for distinct counting in noisy streams
Random projections replace hashing, cutting memory from O(√n) to O(log n).
Counting distinct elements in data streams is a classic problem, but traditional sketches like HyperLogLog assume identical duplicates. In modern settings, such as repeated images of the same person that vary at the pixel level, hash-based methods break down. Recent robust distinct counting in general metric spaces required Õ(√n) memory in the worst case. Now, a new paper by Tsikouras, Caramanis, and Tzamos (arXiv:2605.15571) introduces MaxSketch, which leverages the geometric structure common in learned representations (e.g., embeddings) to dramatically reduce memory. By using random Gaussian projections in a simple max-linear sketch, they prove that m = Õ(log n/ε²) projections suffice to recover the true distinct count within a (1+ε) factor. This bridges classical streaming algorithms and representation learning, showing that geometric structure can fundamentally lower complexity.
Experiments on image streams confirm that MaxSketch accurately estimates distinct counts and generalizes beyond training regimes. The algorithm's memory footprint is exponentially smaller than prior robust methods when data lies near a low-dimensional subspace, which is typical for modern neural network embeddings. This result has immediate implications for analytics on noisy real-time streams—such as surveillance, recommendation systems, and social media monitoring—where approximate duplicates are the norm. The team’s theoretical guarantees and empirical validation make MaxSketch a practical tool for large-scale distinct counting in the era of high-dimensional, noisy data.
- Reduces memory from Õ(√n) to Õ(log n/ε²) for robust distinct counting under geometric structure.
- Uses random Gaussian projections instead of hashing to handle noisy approximate duplicates.
- Validated on image streams, achieving accurate (1+ε)-approximation and generalizing beyond training data.
Why It Matters
Enables efficient distinct counting in noisy real-world streams, a key bottleneck for image and embedding-based analytics.