The human language project: building a Universal Corpus of the world's languages

Authors:
Steven Abney;Steven Bird
Affiliations:
University of Michigan;University of Melbourne and University of Pennsylvania
Venue:
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Year:
2010

Citing 7
Cited 5

GATE: an architecture for development of robust HLT applications

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
The IMDI metadata framework, its current application and future direction

International Journal of Metadata, Semantics and Ontologies
Semisupervised Learning for Computational Linguistics

Semisupervised Learning for Computational Linguistics
Frontiers in linguistic annotation for lower-density languages

LAC '06 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
Natural Language Processing with Python

Natural Language Processing with Python
Statistical Machine Translation

Statistical Machine Translation
A scalable method for preserving oral literature from small languages

ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries

Subjective natural language problems: motivations, applications, characterizations, and implications

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Towards a data model for the Universal Corpus

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Unsupervised multilingual learning

Unsupervised multilingual learning
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

Artificial Intelligence
A smartphone-based ASR data collection tool for under-resourced languages

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a grand challenge to build a corpus that will include all of the world's languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. We propose that the ability to train systems to translate into and out of a given language be the yardstick for determining when we have successfully captured a language. We call on the computational linguistics community to begin work on this Universal Corpus, pursuing the many strands of activity described here, as their contribution to the global effort to document the world's linguistic heritage before more languages fall silent.