Development of Hindi-Punjabi parallel corpus using existing Hindi-Punjabi machine translation system

  • Authors:
  • Pardeep Kumar;Vishal Goyal

  • Affiliations:
  • Punjabi University, Patiala;Punjabi University, Patiala

  • Venue:
  • Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the development of Hindi-Punjabi sentence aligned parallel corpus consisting of 50K sentences using existing Hindi-Punjabi Machine Translation (MT) system (available at http://h2p.learnpunjabi.org). This parallel corpus is utmost important resource for Natural Language applications and research in this field. Thus, it was the need of hour to develop this parallel corpus for working on latest and better techniques. The corpus has been sentence aligned and it is available in both .doc and .xml formats. Shortly, this parallel corpus will be made available on the internet freely to use by the researchers working in NLP. During the development of parallel corpus, errors of different categories present in the Hindi-Punjabi MT System like -- transliteration, out-of-vocabulary, grammar agreement etc. were found. The complete analysis for these errors has also been presented. These errors were removed manually from parallel corpus to develop clean and accurate parallel corpus. The new words list from the out-of-vocabulary words was generated and added into the lexicon of the existing MT System. Thus, adding these words into the dictionary of used Hindi-Punjabi machine translation system has increased its accuracy from 94% to 94.5%.