Research & Papers

New LLM framework extracts and aligns financial segment disclosures with RAG

A research team uses LLMs to solve data gaps in segment reporting from 10-K filings.

Deep Dive

Segment disclosures in 10-K filings are notoriously messy—scattered across tables and narrative text, often incomplete or incompatible across years and firms. Existing structured databases miss nested segments and lack support for longitudinal or cross-company analysis. To fix this, Yue Liu, Zhiyuan Cheng, and Longying Lai built a large language model (LLM) framework that automatically extracts both reportable and nested segment information directly from the raw filings. They further integrated a retrieval-augmented generation (RAG) module that pulls data across multiple filings, enabling queries that require knowledge of organizational changes over time or alignment of geographic segments between firms with different reporting structures.

The framework was validated using two representative scenarios: tracking how a single firm's segment structure evolved over successive years, and aligning geographic segment definitions across different companies. The results confirmed high accuracy in extraction and strong performance on questions requiring cross-period context. The paper, submitted to Accounting Horizons, includes 39 pages and 4 figures, and demonstrates a concrete path for applying LLMs in financial reporting—making segment data more complete, comparable, and usable for analysts, regulators, and researchers alike.

Key Points
  • Extracts both reportable and nested segment information from 10-K tables and narrative text.
  • Uses a RAG system to link data across multiple years and firms for comparative analysis.
  • Validated on longitudinal tracking of segment changes and cross-firm alignment of geographic segments.

Why It Matters

LLMs can automate and standardize the extraction of critical financial segment data, reducing manual effort and enabling deeper cross-company analysis.