Extracting two thousand years of latin from a million book library

Authors:
David Bamman;David Smith
Affiliations:
Tufts University, MA;University of Massachusetts-Amherst, MA
Venue:
Journal on Computing and Cultural Heritage (JOCCH)
Year:
2012

Citing 9
Cited 1

From the old to the new: intergrating hypertext into traditional scholarship

HYPERTEXT '87 Proceedings of the ACM conference on Hypertext
A systematic comparison of various statistical alignment models

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
FRBR: enriching and integrating digital libraries

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Non-projective dependency parsing using spanning tree algorithms

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Building a dynamic lexicon from a digital library

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Dependency parsing by belief propagation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Beyond digital incunabula: modeling the next generation of digital libraries

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries

Student researchers, citizen scholars and the trillion word library

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the rise of large open digitization projects such as the Internet Archive and Google Books, we are witnessing an explosive growth in the number of source texts becoming available to researchers in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin, including classical prose and poetry written under the Roman Empire, ecclesiastical treatises from the Middle Ages, and dissertations from 19th-century Germany written—in Latin—on the philosophy of Hegel. At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders of magnitude. In addition, the much larger collection of books in English, German, French, and other languages already scanned contains unknown numbers of translations for many Latin books, or parts of books. The sheer scale of this collection offers a broad vista of new research questions, and we focus here on both the opportunities and challenges of computing over such a large space of heterogeneous texts. The works in this massive collection do not constitute a finely curated (or much less balanced) corpus of Latin; it is, instead, simply all the Latin that can be extracted, and in its reach of twenty-one centuries (from approximately 200 BCE to 1922 CE) arguably spans the greatest historical distance of any major textual collection today. While we might hope that the size and historical reach of this collection can eventually offer insight into grand questions such as the evolution of a language over both time and space, we must contend as well with the noise inherent in a corpus that has been assembled with minimal human intervention.