Workshop on the evaluation of natural language processing systems
Computational Linguistics
A technique for computer detection and correction of spelling errors
Communications of the ACM
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Information Retrieval
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora
Computational Linguistics
Non-interactive OCR post-correction for giga-scale digitization projects
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Hi-index | 0.00 |
We present a new approach based on anagram hashing to globally handle the typographical variation in large and possibly noisy text collections. Typographical variation is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbours is applied, where near-neighbours are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbours we call a particular character confusion. We present a global way of performing this action: given a possible particular character confusion, we identify - in parallel, i.e. in one single operation on anagram-hash derived bit vectors - all the pairs of text strings in the text collection to which the particular confusion applies. The algorithm proposed here is evaluated on about 23,000 English attested typos from the Reuters rcv1 text collection. We further explore its usefulness for unsupervised linking of a historical Dutch word list to its contemporary counterpart.