Foundations of statistical natural language processing
Foundations of statistical natural language processing
Computational Linguistics - Special issue on using large corpora: I
Rapid customization of an information extraction system for a surprise language
ACM Transactions on Asian Language Information Processing (TALIP)
Aligning sentences in parallel corpora
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning sentences in bilingual corpora using lexical information
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
High-performance bilingual text alignment using statistical and dictionary information
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Inducing multilingual text analysis tools via robust projection across aligned corpora
HLT '01 Proceedings of the first international conference on Human language technology research
Parsing aligned parallel corpus by projecting syntactic relations from annotated source corpus
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Word alignment for languages with scarce resources
ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Modeling machine transliteration as a phrase based statistical machine translation problem
NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Improved algorithm for automatic word alignment for hindi-punjabi parallel corpus
ICDEM'10 Proceedings of the Second international conference on Data Engineering and Management
Hi-index | 0.00 |
In this paper we describe an alignment system that aligns English-Hindi texts at the sentence and word level in parallel corpora. We describe a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. We use regression techniques in order to learn parameters which characterise the relationship between the lengths of two sentences in parallel text. We use a multi-feature approach with dictionary lookup as a primary technique and other methods such as local word grouping, transliteration similarity (edit-distance) and a nearest aligned neighbours approach to deal with many-to-many word alignment. Our experiments are based on the EMILLE (Enabling Minority Language Engineering) corpus. We obtained 99.09% accuracy for many-to-many sentence alignment and 77% precision and 67.79% recall for many-to-many word alignment.