Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs
New deterministic framework tackles LLM 'noise' to improve enterprise text analytics with a Signal-to-Noise Ratio filter.
A team of researchers including Shreeya Verma Kathuria has published a new paper introducing the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) framework. This system directly addresses a core weakness in using Large Language Models (LLMs) for enterprise-grade text analytics: their stochastic, noise-sensitive nature that compromises precision and reproducibility. wSSAS enforces data integrity on large, chaotic datasets through a deterministic, two-phased validation process.
First, the framework organizes raw text into a hierarchical structure of Themes, Stories, and Clusters. It then applies a novel Signal-to-Noise Ratio (SNR) mechanism to prioritize high-value semantic features, ensuring the model focuses on the most representative data points. This scoring is integrated into a Summary-of-Summaries (SoS) architecture to isolate essential information and mitigate background noise during aggregation.
The team tested wSSAS using Google's Gemini 2.0 Flash Lite model across diverse real-world datasets including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews. The experimental results demonstrate that the framework significantly improves clustering integrity and categorization accuracy. Specifically, wSSAS reduces categorization entropy, providing a reproducible, high-precision pathway for LLM-based summarization and classification tasks that were previously hindered by inconsistency.
- Introduces a deterministic framework (wSSAS) to fix LLMs' stochastic noise in text categorization.
- Uses a Signal-to-Noise Ratio (SNR) filter within a Summary-of-Summaries architecture to prioritize key data.
- Tested with Gemini 2.0 Flash Lite, it improved accuracy on Google, Amazon, and Goodreads review datasets.
Why It Matters
Enables reliable, reproducible LLM analytics for enterprise tasks like customer feedback analysis and content moderation.