Discovering parallel text from the World Wide Web

Authors:
Jisong Chen;Rowena Chau;Chung-Hsing Yeh
Affiliations:
Monash University, Clayton, Victoria, Australia;Monash University, Clayton, Victoria, Australia;Monash University, Clayton, Victoria, Australia
Venue:
ACSW Frontiers '04 Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32
Year:
2004

Citing 3
Cited 7

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Construction of a Fuzzy Multilingual Thesaurus and Its Application to Cross-Lingual Text Retrieval

WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development

A DOM tree alignment model for mining parallel data from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Association-based dynamic computation of reputation in web services

International Journal of Web and Grid Services
Improved sentence alignment on parallel web pages using a stochastic tree alignment model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Train the machine with what it can learn: corpus selection for SMT

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Automatic acquisition of chinese–english parallel corpus from the web

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including cross-lingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.