Natural Language Processing Models for Robust Document Categorization
Study finds BiLSTM offers the best balance of 98.56% accuracy and moderate computational cost for real-world automation.
A research team including Radoslaw Roszczyk, Pawel Tecza, Maciej Stodolski, and Krzysztof Siwek has published a study evaluating machine learning models for robust document categorization, a critical task for automating workflows like technical support routing. The paper, "Natural Language Processing Models for Robust Document Categorization," directly addresses the trade-off between classification accuracy and computational efficiency, a key hurdle for integrating AI into production pipelines. To find the optimal balance, the researchers benchmarked three models of varying complexity: a simple Naive Bayes classifier, a bidirectional LSTM (BiLSTM) network, and a fine-tuned transformer-based BERT model.
The experiments revealed clear performance tiers. The fine-tuned BERT model delivered the highest accuracy, consistently exceeding 99%, but required significantly longer training times and greater computational resources. In contrast, the Naive Bayes classifier was the fastest to train, completing in mere milliseconds, but delivered the lowest accuracy at around 94.5%. The BiLSTM model emerged as the most balanced solution, achieving approximately 98.56% accuracy while maintaining moderate training costs and offering robust contextual understanding through its architecture. The study also implemented a fully functional demonstrative system to validate practical applicability, showing that automated routing of technical requests achieved throughput levels impossible through manual processing. The authors conclude that for the examined scenario involving class imbalance, BiLSTM presents the most viable option, while also noting opportunities for future exploration of more efficient transformer architectures.
- BERT achieved the highest accuracy (>99%) but had the highest computational cost and training time.
- BiLSTM provided a strong compromise with 98.56% accuracy and moderate resource requirements, deemed the most balanced solution.
- Naive Bayes was fastest to train (milliseconds) but had the lowest accuracy at ~94.5%.
Why It Matters
Provides a clear framework for engineers to choose the right model based on the accuracy vs. cost trade-off for real-world document automation.