Increasing recall for text re-use in historical documents to support research in the humanities

Authors:
Marco Büchler;Gregory Crane;Maria Moritz;Alison Babeu
Affiliations:
Institute for Computer Science, Leipzig University, Germany;Department of Classics, Tufts University, Boston;Institute for Computer Science, Leipzig University, Germany;Department of Classics, Tufts University, Boston
Venue:
TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Year:
2012

Citing 5
Cited 0

Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Syntactic Query Models for Restatement Retrieval

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Quantified Score

Hi-index	0.00

Visualization

Abstract

High precision text re-use detection allows humanists to discover where and how particular authors are quoted (e.g., the different sections of Plato's work that come in and out of vogue). This paper reports on on-going work to provide the high recall text re-use detection that humanists often demand. Using an edition of one Greek work that marked quotations and paraphrases from the Homeric epics as our testbed, we were able to achieve a recall of at least 94% while maintaining a precision of 73%. This particular study is part of a larger effort to detect text re-use across 15 million words of Greek and 10 million words of Latin available or under development as openly licensed TEI XML.