Multilingual sentence alignment from Wikipedia as multilingual comparable corpora

  • Authors:
  • Min-Hsiang Li;Vitaly Klyuev;Shih-Hung Wu

  • Affiliations:
  • University of Aizu, Aizu-Wakamatsu, Fukushima, Japan and Chaoyang University of Technology, Taiwan;University of Aizu, Aizu-Wakamatsu, Fukushima, Japan;Chaoyang University of Technology, Taiwan

  • Venue:
  • Proceedings of the 13th International Conference on Humans and Computers
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Bilingual dictionaries and the multilingual dictionaries are necessary resources for machine translation and cross language information retrieval. With the help of these dictionaries, an information retrieval system can find documents of similar content in different languages. Maintaining such dictionaries is an interesting research topic. Researchers can collect multilingual parallel corpora from the Internet and find the translation of new words. Therefore, the parallel corpora can help machine translation and cross language information retrieval. Sentence alignment of parallel corpora is a way to mine the necessary knowledge. But in the real world, a lot of the documents can be presented in comparable corpora. Therefore, we introduce the technique for the extraction of parallel sentences from Wikipedia as multilingual comparable corpora.