Non-interactive OCR post-correction for giga-scale digitization projects

Authors:
Martin Reynaert
Affiliations:
Induction of Linguistic Knowledge, Tilburg University, The Netherlands
Venue:
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Year:
2008

Citing 11
Cited 6

The effects of lexical specialization on the growth curve of the vocabulary

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A technique for computer detection and correction of spelling errors

Communications of the ACM
Information Retrieval

Information Retrieval
Performance evaluation for text processing of noisy inputs

Proceedings of the 2005 ACM symposium on Applied computing
Fast Approximate Search in Large Dictionaries

Computational Linguistics
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
OCR post-processing for low density languages

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Retrieval in text collections with historic spelling using linguistic and spelling variants

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
zipfR: word frequency distributions in R

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
A cross-language approach to historic document retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Long, often quite boring, notes of meetings

Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval
Digital weight watching: reconstruction of scanned documents

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Parallel identification of the spelling variants in corpora

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Improving OCR accuracy for classical critical editions

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Stress-testing general purpose digital library software

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
How to carry over historic books into social networks

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.