Automatic Sanskrit segmentizer using finite state transducers

Authors:
Vipul Mittal
Affiliations:
Language Technologies Research Center, Gachibowli, Hyderabad, India
Venue:
ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
Year:
2010

Citing 3
Cited 0

Arabic morphology using only finite-state operations

Semitic '98 Proceedings of the Workshop on Computational Approaches to Semitic Languages
Modeling morphologically rich languages using split words and unstructured dependencies

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
OpenFst: a general and efficient weighted finite-state transducer library

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph analyzer.