Discovering parallel text from the World Wide Web

  • Authors:
  • Jisong Chen;Rowena Chau;Chung-Hsing Yeh

  • Affiliations:
  • Monash University, Clayton, Victoria, Australia;Monash University, Clayton, Victoria, Australia;Monash University, Clayton, Victoria, Australia

  • Venue:
  • ACSW Frontiers '04 Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including cross-lingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.