ODIN: A Model for Adapting and Enriching Legacy Infrastructure

Authors:
William D. Lewis
Affiliations:
University of Washington/CSU Fresno, USA
Venue:
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Year:
2006

Citing 0
Cited 6

An ontology for accessing transcription systems (OATS)

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Parsing, projecting & prototypes: repurposing linguistic data on the web

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
Language ID in the context of harvesting language data off the web

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Applying NLP technologies to the collection and enrichment of language data on the Web to aid linguistic research

LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education
Comparing language similarity across genetic and typologically-based groupings

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
An ontology for accessing transcription systems

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Online Database of Interlinear Text (ODIN)1 is a database of interlinear text "snippets", harvested mostly from scholarly documents posted to theWeb. Although large amounts of language data are posted to the Web as part of scholarly discourse, making the existing "e-Linguistic infrastructure" surprisingly rich, most linguistic data available on the Web exists in legacy formats, is highly displaycentric, and is often difficult to locate or interoperate over. ODIN seeks to leverage this existing infrastructure into a rich, searchable, and interoperable resource by converting readily available semi-structured data to content-centric, searchable formats. To do this, ODIN mines scholarly papers and webpages for instances of linguistic data, focusing mostly on interlinear texts, extracts them, identifies source languages, and makes the instances available to search. Through ODIN's standard search feature, users can locate data by language name or Ethnologue code, and display lists of data by document for languages of interest. The newer Advanced Search feature allows users to locate instances by grammatical markup that is used (e.g., NOM, ACC, ERG, PST, 3SG), and by linguistic constructions (e.g., passives, conditionals, possessives, raising constructions, etc.). The latter are made possible through additional enrichment of discovered data using automated statistical taggers and parsers.