Language independent identification of parallel sentences using Wikipedia

  • Authors:
  • Rohit G. Bharadwaj;Vasudeva Varma

  • Affiliations:
  • International Institute of Information Technology, Hyderabad, Hyderabad, India;International Institute of Information Technology, Hyderabad, Hyderabad, India

  • Venue:
  • Proceedings of the 20th international conference companion on World wide web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper details a novel classification based approach to identify parallel sentences between two languages in a language independent way. We substitute the required language specific resources by the richly structured multilingual content, Wikipedia. Our approach is particularly useful to extract parallel sentences for under-resourced languages like most Indian and African languages, where resources are not readily available with necessary accuracies. We extract various statistics based on the cross lingual links present in Wikipedia and use them to generate feature vectors for each sentence pair. Binary classification of each pair of sentences into parallel or non-parallel has been done using these feature vectors. We achieved a precision upto 78% which is encouraging when compared to other state-of-art approaches.These results support our hypothesis of using Wikipedia to evaluate the parallel coefficient between sentences that can be used to build bilingual dictionaries.