Research & Papers

Nepali Passport Question Answering: A Low-Resource Dataset for Public Service Applications

A new 7-page dataset fine-tunes SBERT and E5 models to outperform BM25 for Nepali passport FAQs.

Deep Dive

A team of researchers has published a new dataset specifically designed to improve AI-powered information retrieval for public services in Nepal. The 'Nepali Passport Question Answering' dataset, created by Funghang Limbu Begha, Praveen Acharya, and Bal Krishna Bal, tackles the critical challenge of building effective systems for low-resource languages by providing a structured collection of Frequently Asked Questions (FAQs) related to passport services. This annotated data fills a significant gap, as Nepali lacks the computational linguistic resources available for languages like English.

In their study, presented at RegICON 2025, the team used this dataset to fine-tune several transformer-based embedding models for semantic search, including Sentence-BERT (SBERT) and the multilingual E5 model. They benchmarked these against the traditional keyword-matching algorithm BM25. The results showed that fine-tuned models, particularly the multilingual E5 embeddings, significantly outperformed BM25. The highest performance was achieved by a hybrid retrieval system that intelligently combined the strengths of the fine-tuned E5 model with the BM25 algorithm.

This work demonstrates a practical blueprint for deploying AI in government and public service contexts where local language support is essential. By proving that modern embedding models can be effectively adapted even with limited data, it opens the door for similar applications across other low-resource languages and bureaucratic domains, from visa applications to social benefit inquiries.

Key Points
  • Created a dedicated Nepali FAQ dataset for passport services, addressing a major data gap for low-resource languages.
  • Fine-tuned transformer models (SBERT, E5) for semantic search, with multilingual E5 achieving the best performance.
  • Implemented a hybrid E5+BM25 retrieval system that outperformed the standard BM25 baseline, offering a deployable solution for public AI.

Why It Matters

Provides a template for building accurate, AI-driven public service chatbots and information systems in underserved languages worldwide.