Research & Papers

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

Massive structured corpus with citation graphs and embeddings unlocks legal AI for India's judiciary.

Deep Dive

A developer known as Vaquill has released a groundbreaking dataset of over 20 million Indian legal documents after two years of work. The corpus includes cases from India's Supreme Court, all 25 High Courts, and 14 Tribunals, each with structured metadata like court, bench, date, parties, judges, and referenced legal sections. Crucially, it features the first machine-readable citation graph for Indian law, classifying relationships between cases as 'followed,' 'distinguished,' 'overruled,' or 'mentioned.' Every document is embedded using Voyage AI's 1024-dimensional dense vectors alongside BM25 sparse vectors for robust retrieval.

The dataset addresses a significant gap in Indian language AI, where most corpora consist of conversational or news text, not the formal, precise register of legal language. It includes cross-referenced data for 23,122 Acts and Statutes with their interpreting cases. The metadata extraction pipeline, built with regex, heuristics, and LLM-based extraction, identifies key entities like judges and advocates, providing valuable training data for legal Named Entity Recognition (NER) models. With judgments averaging 3,000 words and some exceeding 50,000, the corpus serves as an ideal benchmark for testing Retrieval-Augmented Generation (RAG) systems in the legal domain, where citation relationships provide clear ground truth for evaluation.

Available via API and bulk export in JSON and Parquet formats, the data is in the public domain under Indian law, removing copyright barriers for research. The developer notes limitations: coverage is primarily English text from High Courts, with regional language data coming from a translation service. Metadata extraction accuracy varies by court, and the citation graph, built heuristically with LLM assistance, has an estimated 90-95% precision for citation extraction. This dataset represents a foundational resource for advancing legal AI, graph neural network research, and low-resource Indian language model development.

Key Points
  • Contains 20M+ cases with structured metadata and a first-of-its-kind citation graph classifying legal relationships.
  • Every case is embedded with Voyage AI (1024d dense vectors) and BM25 sparse vectors for advanced retrieval.
  • Cross-references 23,122 Acts/Statutes and provides training data for legal NER and RAG benchmarking.

Why It Matters

This dataset provides the foundational infrastructure needed to build AI tools for legal research, case prediction, and analysis within India's massive judiciary system.