Shallow syntax analysis in Sanskrit guided by semantic nets constraints

Authors:
Gérard Huet
Affiliations:
INRIA, Rocquencourt, France
Venue:
Proceedings of the 2006 international workshop on Research issues in digital libraries
Year:
2006

Citing 4
Cited 3

Zen and the Art of Symbolic Computing: Light and Fast Applicative Algorithms for Computational Linguistics

PADL '03 Proceedings of the 5th International Symposium on Practical Aspects of Declarative Languages
From an informal textual lexicon to a well-structured lexical database: An experiment in data reverse engineering

WCRE '01 Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01)
A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger

Journal of Functional Programming
Design of a lexical database for Sanskrit

ElectricDict '04 Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries

Formal Structure of Sanskrit Text: Requirements Analysis for a Mechanical Sanskrit Processor

Sanskrit Computational Linguistics
Analysis of Sanskrit Text: Parsing and Semantic Relations

Sanskrit Computational Linguistics
SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit

Sanskrit Computational Linguistics

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present the state of the art of a computational platform for the analysis of classical Sanskrit. The platform comprises modules for phonology, morphology, segmentation and shallow syntax analysis, organized around a structured lexical database. It relies on the Zen toolkit for finite state automata and transducers, which provides data structures and algorithms for the modular construction and execution of finite state machines, in a functional framework. Some of the layers proceed in bottom-up synthesis mode - for instance, noun and verb morphological modules generate all inflected forms from stems and roots listed in the lexicon. Morphemes are assembled through internal sandhi, and the inflected forms are stored with morphological tags in dictionaries usable for lemmatizing. These dictionaries are then compiled into transducers, implementing the analysis of external sandhi, the phonological process which merges words together by euphony. This provides a tagging segmenter, which analyses a sentence presented as a stream of phonemes and produces a stream of tagged lexical entries, hyperlinked to the lexicon. The next layer is a syntax analyser, guided by semantic nets constraints expressing dependencies between the word forms. Finite verb forms demand semantic roles, according to valency patterns depending on the voice (active, passive) of the form and the governance (transitive, etc) of the root. Conversely, noun/adjective forms provide actors which may fill those roles, provided agreement constraints are satisfied. Tool words are mapped to transducers operating on tagged streams, allowing the modeling of linguistic phenomena such as coordination by abstract interpretation of actor streams. The parser ranks the various interpretations (matching actors with roles) with penalties, and returns to the user the minimum penalty analyses, for final validation of ambiguities. The whole platform is organized as a Web service, allowing the piecewise tagging of a Sanskrit text.