An experiment in detection and correction of malapropisms through the web

Authors:
Igor A. Bolshakov
Affiliations:
Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Venue:
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2005

Citing 6
Cited 2

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Compilation of a Spanish Representative Corpus

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Correcting real-word spelling errors by restoring lexical cohesion

Natural Language Engineering
A decision tree of bigrams is an accurate predictor of word sense

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies

Co-related verb argument selectional preferences

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Stable coordinate pairs in spanish: statistical and structural description

CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Malapropism is a type of semantic errors. It replaces one content word by another content word similar in sound but semantically incompatible with the context and thus destructing text cohesion. We propose to signal a malapropism when a pair of syntactically linked content words in a text exhibits the value of a specially defined Semantic Compatibility Index (SCI) lower than a predetermined threshold. SCI is computed through the web statistics of occurrences of words got together and apart. A malapropism detected, all possible candidates for correction of both words are taken from precompiled dictionaries of paronyms, i.e. words distant a letter or a few prefixes or suffixes from one another. Heuristic rules are proposed to retain only a few highly SCI-ranked candidates for the user's decision. The experiment on mala-propism detection and correction is done for a hundred Russian text fragments—mainly from the web newswire—in both correct and falsified form, as well as for several hundreds of correction candidates. The raw statistics of occurrences is taken from the web searcher Yandex. Within certain limitations, the experiment gave very promising results.