Non-word identification or spell checking without a dictionary

Authors:
Donald C. Comeau;W. John Wilbur
Affiliations:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, Room N611S, 8600 Rockville Pike, Bethesda, MD;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, Room N611S, 8600 Rockville Pike, Bethesda, MD
Venue:
Journal of the American Society for Information Science and Technology
Year:
2004

Citing 11
Cited 2

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Programming pearls: a spelling checker

Communications of the ACM
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Corpus-based statistical screening for content-bearing terms

Journal of the American Society for Information Science and Technology
A technique for computer detection and correction of spelling errors

Communications of the ACM
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Cooperative error handling and shallow processing

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Spelling correction using context

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Combining Trigram-based and feature-based methods for context-sensitive spelling correction

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Estimators for stochastic "Unification-Based" grammars

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Exploiting extremely rare features in text categorization

ECML'06 Proceedings of the 17th European conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

MEDLINE® is a collection of more than 12 million references and abstracts covering recent life science literature. With its continued growth and cutting-edge terminology, spell-checking with a traditional lexicon based approach requires significant additional manual followup. In this work, an internal corpus based context quality rating α, frequency, and simple misspelling transformations are used to rank words from most likely to be misspellings to least likely. Eleven-point average precisions of 0.891 have been achieved within a class of 42,340 all alphabetic words having an α score less than 10. Our models predict that 16,274 or 38% of these words are misspellings. Based on test data, this result has a recall of 79% and a precision of 86%. In other words, spell checking can be done by statistics instead of with a dictionary. As an application we examine the time history of low α words in MEDLINE® titles and abstracts.