Morphological richness offsets resource demand- experiences in constructing a POS tagger for Hindi

  • Authors:
  • Smriti Singh;Kuhoo Gupta;Manish Shrivastava;Pushpak Bhattacharyya

  • Affiliations:
  • Indian Institute of Technology, Mumbai, Maharashtra, India;Indian Institute of Technology, Mumbai, Maharashtra, India;Indian Institute of Technology, Mumbai, Maharashtra, India;Indian Institute of Technology, Mumbai, Maharashtra, India

  • Venue:
  • COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we report our work on building a POS tagger for a morphologically rich language- Hindi. The theme of the research is to vindicate the stand that- if morphology is strong and harnessable, then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged (lacking annotated corpora) languages can make use of. The methodology makes use of locally annotated modestly-sized corpora (15,562 words), exhaustive morpohological analysis backed by high-coverage lexicon and a decision tree based learning algorithm (CN2). The evaluation of the system was done with 4-fold cross validation of the corpora in the news domain (www.bbc.co.uk/hindi). The current accuracy of POS tagging is 93.45% and can be further improved.