Research & Papers

FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation

Open-source web toolkit achieves competitive performance in Vietnamese-English text analysis without coding, using fine-tuned Qwen2.5 and TabularisAI models.

Deep Dive

A team of researchers from Lancaster University and other institutions has released FreeTxt-Vi, a significant open-source toolkit designed to lower the barrier to advanced text analysis for Vietnamese and English. The web-based platform uniquely sits at the intersection of corpus linguistics and modern NLP, allowing users to build, explore, and interpret bilingual text collections through an intuitive interface that requires no coding. Its core technical innovation is a unified bilingual processing pipeline that integrates a hybrid segmentation strategy combining VnCoreNLP with Byte Pair Encoding (BPE), a fine-tuned TabularisAI model for sentiment analysis, and a fine-tuned Qwen2.5 model for generating abstractive summaries.

Unlike many platforms evaluated only as a whole, the team conducted a rigorous three-part evaluation of FreeTxt-Vi's individual components for segmentation, sentiment analysis, and summarization. The results show the toolkit achieves competitive or superior performance compared to widely used baseline models in both languages. By providing these benchmarked, high-quality tools in a single accessible package, FreeTxt-Vi directly addresses the underrepresentation of Vietnamese in NLP resources. It is positioned to support reproducible research and scale qualitative analysis in key domains like digital humanities, cultural heritage, and social sciences, where processing large volumes of text data has traditionally been a technical challenge.

Key Points
  • Unified bilingual pipeline uses hybrid VnCoreNLP/BPE segmentation, a fine-tuned TabularisAI sentiment model, and a fine-tuned Qwen2.5 model for summarization.
  • Three-part evaluation shows the toolkit achieves competitive or superior performance to established baselines in both Vietnamese and English.
  • Web-based, no-code design merges corpus linguistics features (concordancing, keyword analysis) with transformer-based NLP to support research in underrepresented languages.

Why It Matters

Democratizes high-quality NLP for a major underrepresented language, enabling scalable text analysis in research and industry without technical expertise.