Parallel identification of the spelling variants in corpora

Authors:
Martin Reynaert
Affiliations:
Tilburg University, The Netherlands
Venue:
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Year:
2009

Citing 7
Cited 0

Workshop on the evaluation of natural language processing systems

Computational Linguistics
A technique for computer detection and correction of spelling errors

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
Non-interactive OCR post-correction for giga-scale digitization projects

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new approach based on anagram hashing to globally handle the typographical variation in large and possibly noisy text collections. Typographical variation is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbours is applied, where near-neighbours are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbours we call a particular character confusion. We present a global way of performing this action: given a possible particular character confusion, we identify - in parallel, i.e. in one single operation on anagram-hash derived bit vectors - all the pairs of text strings in the text collection to which the particular confusion applies. The algorithm proposed here is evaluated on about 23,000 English attested typos from the Reuters rcv1 text collection. We further explore its usefulness for unsupervised linking of a historical Dutch word list to its contemporary counterpart.