Research & Papers

[R] JADS: Joint Aspect Discovery and Summarization — outperforms two-step pipelines by 8-9 ROUGE points with self-supervised training

New self-supervised model improves both clustering and summarization simultaneously with end-to-end training.

Deep Dive

Researchers have developed JADS (Joint Aspect Discovery and Summarization), a novel framework that unifies multi-document topic discovery and summarization into a single end-to-end model. Traditional approaches use separate clustering (like BERTopic) and summarization (like Longformer) stages, where clustering errors propagate and the two tasks can't mutually improve. JADS uses self-supervised data creation—mixing sentences from multiple articles and using original summaries as supervision—to train a Longformer encoder-decoder that processes up to 16K tokens. Results show dramatic improvements: JADS achieves 37.33 ROUGE-1 score versus 26.98 for two-step pipelines, an 8-9 point gain. The model also produces exactly K clusters with 0.79 BERTScore F1 versus traditional methods' 2.43 average clusters and 0.64 F1. The key innovation is end-to-end differentiability, allowing summarization gradients to flow back and improve clustering, creating genuine mutual reinforcement between tasks.

Key Points
  • JADS achieves 37.33 ROUGE-1 score, beating traditional two-step pipelines by 8-9 points
  • Model processes 16K tokens with Longformer encoder-decoder and requires no manual annotation
  • Improves clustering quality to 0.79 BERTScore F1 while producing exactly K clusters

Why It Matters

Eliminates error propagation in document analysis pipelines and enables more accurate automated content organization and summarization.