Using automated error profiling of texts for improved selection of correction candidates for garbled tokens

Authors:
Stoyan Mihov;Petar Mitankin;Annette Gotscharek;Ulrich Reffle;Klaus U. Schulz;Christoph Ringlstetter
Affiliations:
IPP, Bulgarian Academy of Sciences;IPP, Bulgarian Academy of Sciences;CIS, University of Munich;CIS, University of Munich;CIS, University of Munich;AICML, University of Alberta
Venue:
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Year:
2007

Citing 8
Cited 1

Fast approximate string matching

Software—Practice & Experience
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Finding approximate matches in large lexicons

Software—Practice & Experience
The String-to-String Correction Problem

Journal of the ACM (JACM)
Fast Approximate Search in Large Dictionaries

Computational Linguistics
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02

Successfully detecting and correcting false friends using channel profiles

Proceedings of the second workshop on Analytics for noisy unstructured text data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.