A beam-search extraction algorithm for comparable data
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
WikiBABEL: a wiki-style platform for creation of parallel data
ACLDemos '09 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
The Cross-Lingual Wiki Engine: enabling collaboration across language barriers
WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Extracting parallel sentences from comparable corpora using document level alignment
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language-independent context aware query translation using Wikipedia
BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Analysis and refinement of cross-lingual entity linking
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Hi-index | 0.00 |
This paper details a novel classification based approach to identify parallel sentences between two languages in a language independent way. We substitute the required language specific resources by the richly structured multilingual content, Wikipedia. Our approach is particularly useful to extract parallel sentences for under-resourced languages like most Indian and African languages, where resources are not readily available with necessary accuracies. We extract various statistics based on the cross lingual links present in Wikipedia and use them to generate feature vectors for each sentence pair. Binary classification of each pair of sentences into parallel or non-parallel has been done using these feature vectors. We achieved a precision upto 78% which is encouraging when compared to other state-of-art approaches.These results support our hypothesis of using Wikipedia to evaluate the parallel coefficient between sentences that can be used to build bilingual dictionaries.