Feature-based method for document alignment in comparable news corpora

Authors:
Thuy Vu;Ai Ti Aw;Min Zhang
Affiliations:
Institute for Infocomm Research, South Tower, Singapore;Institute for Infocomm Research, South Tower, Singapore;Institute for Infocomm Research, South Tower, Singapore
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 9
Cited 7

A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Mining comparable bilingual text corpora for cross-language information integration

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Weakly supervised named entity transliteration and discovery from multilingual comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multilingual document clustering: an heuristic approach based on cognate named entities

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multilingual and cross-lingual news topic tracking

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
SlideSeer: a digital library of aligned document and presentation pairs

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Automatic identification of parallel documents with light or without linguistic resources

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence

MARS: multilingual access and retrieval system with enhanced query translation and document retrieval

ACLDemos '09 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
EM-based hybrid model for bilingual terminology extraction from comparable corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
From bilingual dictionaries to interlingual document representations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Improving bilingual projections via sparse covariance matrices

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Detecting highly confident word translations from comparable corpora without any prior knowledge

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Information Retrieval
Mining a Persian-English comparable corpus for cross-language information retrieval

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transform-based term frequency distribution feature is very effective. It contributes 4.1% and 8% to performance improvement over Pearson's correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3% on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.