A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Similarity Search In Sequence Databases
FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Mining comparable bilingual text corpora for cross-language information integration
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Weakly supervised named entity transliteration and discovery from multilingual comparable corpora
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multilingual document clustering: an heuristic approach based on cognate named entities
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multilingual and cross-lingual news topic tracking
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
SlideSeer: a digital library of aligned document and presentation pairs
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Automatic identification of parallel documents with light or without linguistic resources
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
ACLDemos '09 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
EM-based hybrid model for bilingual terminology extraction from comparable corpora
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
From bilingual dictionaries to interlingual document representations
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Improving bilingual projections via sparse covariance matrices
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Detecting highly confident word translations from comparable corpora without any prior knowledge
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Mining a Persian-English comparable corpus for cross-language information retrieval
Information Processing and Management: an International Journal
Hi-index | 0.00 |
In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transform-based term frequency distribution feature is very effective. It contributes 4.1% and 8% to performance improvement over Pearson's correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3% on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.