The effects of lexical specialization on the growth curve of the vocabulary
Computational Linguistics
Foundations of statistical natural language processing
Foundations of statistical natural language processing
A technique for computer detection and correction of spelling errors
Communications of the ACM
Information Retrieval
Performance evaluation for text processing of noisy inputs
Proceedings of the 2005 ACM symposium on Applied computing
Fast Approximate Search in Large Dictionaries
Computational Linguistics
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora
Computational Linguistics
OCR post-processing for low density languages
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Retrieval in text collections with historic spelling using linguistic and spelling variants
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
zipfR: word frequency distributions in R
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
A cross-language approach to historic document retrieval
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Long, often quite boring, notes of meetings
Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval
Digital weight watching: reconstruction of scanned documents
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Parallel identification of the spelling variants in corpora
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Improving OCR accuracy for classical critical editions
ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Stress-testing general purpose digital library software
ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
How to carry over historic books into social networks
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Hi-index | 0.00 |
This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.