Automatic Sanskrit segmentizer using finite state transducers

  • Authors:
  • Vipul Mittal

  • Affiliations:
  • Language Technologies Research Center, Gachibowli, Hyderabad, India

  • Venue:
  • ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph analyzer.