Using automated error profiling of texts for improved selection of correction candidates for garbled tokens

  • Authors:
  • Stoyan Mihov;Petar Mitankin;Annette Gotscharek;Ulrich Reffle;Klaus U. Schulz;Christoph Ringlstetter

  • Affiliations:
  • IPP, Bulgarian Academy of Sciences;IPP, Bulgarian Academy of Sciences;CIS, University of Munich;CIS, University of Munich;CIS, University of Munich;AICML, University of Alberta

  • Venue:
  • AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.