Adaptive Parallel Sentences Mining from Web Bilingual News Collection

Authors:
Bing Zhao;Stephan Vogel
Affiliations:
-;-
Venue:
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Year:
2002

Citing 0
Cited 17

Using noisy bilingual data for statistical machine translation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Efficient optimization for bilingual sentence alignment based on linear regression

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Named entity translation matching and learning: With application for mining unseen translations

ACM Transactions on Information Systems (TOIS)
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A DOM tree alignment model for mining parallel data from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
BiTAM: bilingual topic AdMixture models for word alignment

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Improved sentence alignment on parallel web pages using a stochastic tree alignment model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
An empirical study on web mining of parallel data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Two ways to use a noisy parallel news corpus for improving statistical machine translation

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Identifying parallel documents from a large bilingual collection of texts: application to parallel article extraction in Wikipedia

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Automatic parallel fragment extraction from noisy data

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Position-Aligned translation model for citation recommendation

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper a robust, adaptive approach for miningparallel sentences from a bilingual comparable newscollection is described. Sentence length models andlexicon-based models are combined under a maximumlikelihood criterion. Specific models are proposed to handleinsertions and deletions that are frequent in bilingualdata collected from the web. The proposed approach isadaptive, updating the translation lexicon iteratively usingthe mined parallel data to get better vocabulary coverageand translation probability parameter estimation.Experiments are carried out on 10 years of Xinhuabilingual news collection. Using the mined data, we getsignificant improvement in word-to-word alignment accuracyin machine translation modeling.