A new generation of textual corpora: mining corpora from very large collections

Authors:
Gordon Stewart;Gregory Crane;Alison Babeu
Affiliations:
Harvard University, Cambridge, MA;Tufts University, Medford, MA;Tufts University, Medford, MA
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 21
Cited 3

C4.5: programs for machine learning

C4.5: programs for machine learning
Drudgery and deep thought

Communications of the ACM
Interactive Timeline Viewer (ItLv): A Tool to Visualize Variants Among Documents

Visual Interfaces to Digital Libraries [JCDL 2002 Workshop]
Visualization of Variants in Textual Collations to Analyze the Evolution of Literary Works in the Cervantes Project

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Distributed proofreading

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
A General System for the Retrieval of Document Images from Digital Libraries

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Automatic Indexing and Reformulation of Ancient Dictionaries

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
The Bible and multilingual optical character recognition

Communications of the ACM - 3d hard copy
Finding a catalog: generating analytical catalog records from well-structured digital texts

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Extending the text: digital editions and the hypertextual paradigm

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Parallel texts

Natural Language Engineering
Textual indexation of ancient documents

Proceedings of the 2005 ACM symposium on Document engineering
An Old Greek Handwritten OCR System

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
A novel user interface for online literary documents

OZCHI '05 Proceedings of the 17th Australia conference on Computer-Human Interaction: Citizens Online: Considerations for Today and the Future
Sentence alignment for monolingual comparable corpora

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Beyond digital incunabula: modeling the next generation of digital libraries

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Integrating diverse research in a digital library focused on a single author

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
A semi-automatic adaptive OCR for digital libraries

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Identifying Quotations in Reference Works and Primary Materials

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Improving OCR accuracy for classical critical editions

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95% professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for a generation. As digital collections expand, we will be able to collate multiple editions against each other, identify quotations of primary sources, and provide a new generation of services.