Language independent identification of parallel sentences using Wikipedia

Authors:
Rohit G. Bharadwaj;Vasudeva Varma
Affiliations:
International Institute of Information Technology, Hyderabad, Hyderabad, India;International Institute of Information Technology, Hyderabad, Hyderabad, India
Venue:
Proceedings of the 20th international conference companion on World wide web
Year:
2011

Citing 4
Cited 2

A beam-search extraction algorithm for comparable data

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
WikiBABEL: a wiki-style platform for creation of parallel data

ACLDemos '09 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
The Cross-Lingual Wiki Engine: enabling collaboration across language barriers

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Language-independent context aware query translation using Wikipedia

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Analysis and refinement of cross-lingual entity linking

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper details a novel classification based approach to identify parallel sentences between two languages in a language independent way. We substitute the required language specific resources by the richly structured multilingual content, Wikipedia. Our approach is particularly useful to extract parallel sentences for under-resourced languages like most Indian and African languages, where resources are not readily available with necessary accuracies. We extract various statistics based on the cross lingual links present in Wikipedia and use them to generate feature vectors for each sentence pair. Binary classification of each pair of sentences into parallel or non-parallel has been done using these feature vectors. We achieved a precision upto 78% which is encouraging when compared to other state-of-art approaches.These results support our hypothesis of using Wikipedia to evaluate the parallel coefficient between sentences that can be used to build bilingual dictionaries.