Multilingual sentence alignment from Wikipedia as multilingual comparable corpora

Authors:
Min-Hsiang Li;Vitaly Klyuev;Shih-Hung Wu
Affiliations:
University of Aizu, Aizu-Wakamatsu, Fukushima, Japan and Chaoyang University of Technology, Taiwan;University of Aizu, Aizu-Wakamatsu, Fukushima, Japan;Chaoyang University of Technology, Taiwan
Venue:
Proceedings of the 13th International Conference on Humans and Computers
Year:
2010

Citing 4
Cited 1

A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management

IEEE Internet Computing
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Cross language indexing and retrieval of the cypriot digital antiquities repository

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bilingual dictionaries and the multilingual dictionaries are necessary resources for machine translation and cross language information retrieval. With the help of these dictionaries, an information retrieval system can find documents of similar content in different languages. Maintaining such dictionaries is an interesting research topic. Researchers can collect multilingual parallel corpora from the Internet and find the translation of new words. Therefore, the parallel corpora can help machine translation and cross language information retrieval. Sentence alignment of parallel corpora is a way to mine the necessary knowledge. But in the real world, a lot of the documents can be presented in comparable corpora. Therefore, we introduce the technique for the extraction of parallel sentences from Wikipedia as multilingual comparable corpora.