Extracting parallel paragraphs and sentences from english-persian translated documents

Authors:
Mohammad Sadegh Rasooli;Omid Kashefi;Behrouz Minaei-Bidgoli
Affiliations:
Department of Computer Engineering, Iran University of Science and Technology, Iran;Department of Computer Engineering, Iran University of Science and Technology, Iran;Department of Computer Engineering, Iran University of Science and Technology, Iran
Venue:
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Year:
2011

Citing 22
Cited 0

A Multilingual Procedure for Dictionary-Based Sentence Alignment

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
High-performance bilingual text alignment using statistical and dictionary information

Natural Language Engineering
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning sentences in bilingual corpora using lexical information

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
A Hybrid Approach to Sentence Alignment Using Genetic Algorithm

ICCTA '07 Proceedings of the International Conference on Computing: Theory and Applications
Sentence alignment using P-NNT and GMM

Computer Speech and Language
Improved sentence alignment on parallel web pages using a stochastic tree alignment model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Context-based sentence alignment in parallel corpora

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Fast-Champollion: a fast and robust sentence alignment algorithm

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
TEP: Tehran English-Persian parallel corpus

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Bitext Alignment

Bitext Alignment
Bilingual sentence alignment based on punctuation statistics and lexicon

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.