Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Construction of a Fuzzy Multilingual Thesaurus and Its Application to Cross-Lingual Text Retrieval
WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
A DOM tree alignment model for mining parallel data from the web
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Association-based dynamic computation of reputation in web services
International Journal of Web and Grid Services
Improved sentence alignment on parallel web pages using a stochastic tree alignment model
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Train the machine with what it can learn: corpus selection for SMT
BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Automatic acquisition of chinese–english parallel corpus from the web
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Translation techniques in cross-language information retrieval
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including cross-lingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.