Towards a data model for the Universal Corpus

Authors:
Steven Abney;Steven Bird
Affiliations:
University of Michigan;University of Melbourne and University of Pennsylvania
Venue:
BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Year:
2011

Citing 2
Cited 0

CoNLL-X shared task on multilingual dependency parsing

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
The human language project: building a Universal Corpus of the world's languages

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the design of a comparable corpus that spans all of the world's languages and facilitates large-scale cross-linguistic processing. This Universal Corpus consists of text collections aligned at the document and sentence level, multilingual wordlists, and a small set of morphological, lexical, and syntactic annotations. The design encompasses submission, storage, and access. Submission preserves the integrity of the work, allows asynchronous updates, and facilitates scholarly citation. Storage employs a cloud-hosted filestore containing normalized source data together with a database of texts and annotations. Access is permitted to the filestore, the database, and an application programming interface. All aspects of the Universal Corpus are open, and we invite community participation in its design and implementation, and in supplying and using its data.