wikiBABEL: community creation of multilingual data

Authors:
A. Kumaran;K. Saravanan;Sandor Maurice
Affiliations:
Microsoft Research India, Bangalore, India;Microsoft Research India, Bangalore, India;Microsoft Research, Redmond, WA
Venue:
WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Year:
2008

Citing 5
Cited 4

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Translation the Wiki way

Proceedings of the 2006 international symposium on Wikis
The Pyramid Method: Incorporating human content selection variation in summarization evaluation

ACM Transactions on Speech and Language Processing (TSLP)

WikiBABEL: a wiki-style platform for creation of parallel data

ACLDemos '09 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
Enabling monolingual translators: post-editing vs. options

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Human–Computer Interaction and Global Development

Foundations and Trends in Human-Computer Interaction
VidWiki: enabling the crowd to improve the legibility of online educational videos

Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a collaborative framework -- wikiBABEL -- for the efficient and effective creation of multilingual content by a community of users. The wikiBABEL framework leverages the availability of fairly stable content in a source language (typically, English) and a reasonable and not necessarily perfect machine translation system between the source language and a given target language, to create the rough initial content in the target language that is published in a collaborative platform. The platform provides an intuitive user interface and a set of linguistic tools for collaborative correction of the rough content by a community of users, aiding creation of clean content in the target language. We describe the architectural components implementing the wikiBABEL framework, namely, the systems for source and target language content management, mechanisms for coordination and collaboration and intuitive user interface for multilingual editing and review. Importantly, we discuss the integrated linguistic resources and tools, such as, bilingual dictionaries, machine translation and transliteration systems, etc., to help the users during the content correction and creation process. In addition, we analyze and present the prime factors -- user-interface features or linguistic tools and resources -- that significantly influence the user experiences in multilingual content creation. In addition to the creation of multilingual content, another significant motivation for the wikiBABEL framework is the creation of parallel corpora as a by-product. Parallel linguistic corpora are very valuable resources for both Statistical Machine Translation (SMT) and Crosslingual Information Retrieval (CLIR) research, and may be mined effectively from multilingual data with significant content overlap, as may be created in the wikiBABEL framework. Creation of parallel corpora by professional translators is very expensive, and hence the SMT and CLIR research have been largely confined to a handful of languages. Our attempt to engage the large and diverse Internet user population may aid creation of such linguistic resources economically, and may make computational linguistics research possible and practical in many languages of the world.