A hybrid approach to align sentences and words in English-Hindi parallel corpora

Authors:
Niraj Aswani;Robert Gaizauskas
Affiliations:
University of Sheffield, Sheffield, UK;University of Sheffield, Sheffield, UK
Venue:
ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Year:
2005

Citing 8
Cited 4

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
Rapid customization of an information extraction system for a surprise language

ACM Transactions on Asian Language Information Processing (TALIP)
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning sentences in bilingual corpora using lexical information

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
High-performance bilingual text alignment using statistical and dictionary information

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research

Parsing aligned parallel corpus by projecting syntactic relations from annotated source corpus

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Word alignment for languages with scarce resources

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Modeling machine transliteration as a phrase based statistical machine translation problem

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Improved algorithm for automatic word alignment for hindi-punjabi parallel corpus

ICDEM'10 Proceedings of the Second international conference on Data Engineering and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe an alignment system that aligns English-Hindi texts at the sentence and word level in parallel corpora. We describe a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. We use regression techniques in order to learn parameters which characterise the relationship between the lengths of two sentences in parallel text. We use a multi-feature approach with dictionary lookup as a primary technique and other methods such as local word grouping, transliteration similarity (edit-distance) and a nearest aligned neighbours approach to deal with many-to-many word alignment. Our experiments are based on the EMILLE (Enabling Minority Language Engineering) corpus. We obtained 99.09% accuracy for many-to-many sentence alignment and 77% precision and 67.79% recall for many-to-many word alignment.