Development of Hindi-Punjabi parallel corpus using existing Hindi-Punjabi machine translation system

Authors:
Pardeep Kumar;Vishal Goyal
Affiliations:
Punjabi University, Patiala;Punjabi University, Patiala
Venue:
Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia
Year:
2010

Citing 4
Cited 0

Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
The Duluth word alignment system

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the development of Hindi-Punjabi sentence aligned parallel corpus consisting of 50K sentences using existing Hindi-Punjabi Machine Translation (MT) system (available at http://h2p.learnpunjabi.org). This parallel corpus is utmost important resource for Natural Language applications and research in this field. Thus, it was the need of hour to develop this parallel corpus for working on latest and better techniques. The corpus has been sentence aligned and it is available in both .doc and .xml formats. Shortly, this parallel corpus will be made available on the internet freely to use by the researchers working in NLP. During the development of parallel corpus, errors of different categories present in the Hindi-Punjabi MT System like -- transliteration, out-of-vocabulary, grammar agreement etc. were found. The complete analysis for these errors has also been presented. These errors were removed manually from parallel corpus to develop clean and accurate parallel corpus. The new words list from the out-of-vocabulary words was generated and added into the lexicon of the existing MT System. Thus, adding these words into the dictionary of used Hindi-Punjabi machine translation system has increased its accuracy from 94% to 94.5%.