Comparison of Outlier Detection Algorithms on String Data
A new bachelor's thesis introduces a regex-based method and a modified LOF algorithm for finding anomalies in text data.
A new academic paper by researcher Philip Maus tackles the underexplored challenge of outlier detection in string data. While anomaly detection is a mature field for numerical data, finding irregularities in text—like system logs, codes, or single-word entries—remains difficult. Maus's bachelor's thesis, published on arXiv, directly compares two innovative algorithmic approaches designed for this specific task, moving beyond traditional numerical methods.
The first approach is a tailored variant of the established Local Outlier Factor (LOF) algorithm. It uses a custom, weighted Levenshtein distance measure that can account for hierarchical character classes (like grouping vowels or digits), allowing it to be tuned for specific datasets. The second is a completely new syntactical method based on a hierarchical left regular expression (regex) learner, which infers a pattern for 'normal' data and flags strings that don't conform.
Through experiments on various datasets, Maus demonstrates that both algorithms can effectively identify outliers in string data, but with different strengths. The regex-based learner excels when the 'normal' data has a clear, distinct structure that outliers violate (e.g., spotting a malformed product code). Conversely, the modified LOF algorithm performs best when outliers have a significantly different edit distance from the core data cluster, even if the structure is similar. This comparative analysis provides a practical guide for data scientists and engineers choosing the right tool for cleaning logs or detecting anomalies in textual datasets.
- Introduces a modified Local Outlier Factor (LOF) algorithm using a tunable, weighted Levenshtein distance for string density calculation.
- Presents a novel syntactical outlier detection method based on a hierarchical left regular expression (regex) learner.
- Experimental results show the regex method excels with structured data, while the LOF variant is better for edit-distance-based anomalies.
Why It Matters
Provides practical algorithms for automating data cleaning and anomaly detection in system logs, product codes, and other critical text-based datasets.