Research & Papers

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

42,012 legal premise-hypothesis pairs to train AI on Vietnamese statutory logic.

Deep Dive

Researchers from Vietnam (Duong, Ho, Huynh, and Nguyen) present ViLegalNLI, the first large-scale Natural Language Inference dataset tailored for Vietnamese legal documents. Built from 42,012 premise-hypothesis pairs extracted from official statutory texts, each pair is labeled as Entailment or Non-entailment. To ensure high quality, the team used a semi-automatic generation framework that leverages large language models for controlled hypothesis creation and systematic validation, incorporating artifact mitigation and cross-model consistency checks. The dataset spans multiple legal domains and captures realistic reasoning patterns like paraphrasing, logical implication, and invalid inferences.

Extensive experiments on ViLegalNLI using multilingual models, Vietnamese-specific pre-trained models, and instruction-tuned LLMs reveal that few-shot LLM configurations consistently outperform other approaches. Performance is strongly influenced by hypothesis length, lexical overlap, and reasoning complexity. Cross-domain evaluations highlight the difficulty of generalizing legal inference across different legal fields. The dataset is publicly available for research, establishing a foundational benchmark for Vietnamese legal NLI and supporting future work in legal reasoning, statutory text understanding, and reliable AI systems for legal analysis.

Key Points
  • 42,012 premise-hypothesis pairs sourced from official Vietnamese statutory documents with binary entailment labeling.
  • Semi-automatic generation using LLMs with artifact mitigation and cross-model validation for annotation reliability.
  • Few-shot LLMs achieve best performance, but cross-domain generalization remains a key challenge.

Why It Matters

Enables reliable AI systems for Vietnamese legal analysis and decision support, a crucial underserved domain.