When printed hypertexts go digital: information extraction from the parsing of indices

Authors:
Matteo Romanello;Monica Berti;Alison Babeu;Gregory Crane
Affiliations:
The Perseus Project - Tufts University, Medford, MA, USA;The Perseus Project - Tufts University, Medford, MA, USA;The Perseus Project - Tufts University, Medford, MA, USA;The Perseus Project - Tufts University, Medford, MA, USA
Venue:
Proceedings of the 20th ACM conference on Hypertext and hypermedia
Year:
2009

Citing 6
Cited 1

ANTLR: a predicated-LL(k) parser generator

Software—Practice & Experience
A systematic approach to fuzzy parsing

Software—Practice & Experience
Scholarly hypertext: self-represented complexity

HYPERTEXT '97 Proceedings of the eighth ACM conference on Hypertext
Hypertext and the new Oxford English Dictionary

HYPERTEXT '87 Proceedings of the ACM conference on Hypertext
Automatic Indexing and Reformulation of Ancient Dictionaries

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Generating links by mining quotations

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia

Harvesting indices to grow a controlled vocabulary: towards improved access to historical legal texts

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern critical editions of ancient works generally include manually created indices of other sources quoted in the text. Since indices can be considered as a form of domain specific language, the paper presents a parsing-based approach to the problem of extracting information from them to support the creation of a collection of fragmentary texts. This paper first considers the characteristics and structure of quotation indices and their importance when dealing with fragmentary texts. It then presents the results of applying a fuzzy parser to the OCR transcription of an index of quotations to extract information from potentially noisy input.