A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger

Authors:
Gérard Huet
Affiliations:
INRIA Rocquencourt, BP 105, F-78153 Le Chesnay Cedex (e-mail: Gerard.Huet@inria.fr)
Venue:
Journal of Functional Programming
Year:
2005

Citing 18
Cited 8

Analytic variations on the common subexpression problem

Proceedings of the seventeenth international colloquium on Automata, languages and programming
ML for the working programmer

ML for the working programmer
Regular models of phonological rule systems

Computational Linguistics - Special issue on computational phonology
Deterministic part-of-speech tagging with finite-state transducers

Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
The functional approach to programming

The functional approach to programming
Fast algorithms for sorting and searching strings

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Proving termination with multiset orderings

Communications of the ACM
The next 700 programming languages

Communications of the ACM
Finite-State Language Processing

Finite-State Language Processing
Zen and the Art of Symbolic Computing: Light and Fast Applicative Algorithms for Computational Linguistics

PADL '03 Proceedings of the 5th International Symposium on Practical Aspects of Declarative Languages
Grammatical Framework

Journal of Functional Programming
The Zipper

Journal of Functional Programming
Incremental construction of minimal acyclic finite-state automata

Computational Linguistics - Special issue on finite-state methods in NLP
Finite-state transducers in language and speech processing

Computational Linguistics
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
A general computational model for word-form recognition and production

ACL '84 Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics
The replace operator

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics

Shallow syntax analysis in Sanskrit guided by semantic nets constraints

Proceedings of the 2006 international workshop on Research issues in digital libraries
Finite Eilenberg Machines

CIAA '08 Proceedings of the 13th international conference on Implementation and Applications of Automata
Strengths and weaknesses of finite-state technology: A case study in morphological grammar development

Natural Language Engineering
Modeling Pāninian Grammar

Sanskrit Computational Linguistics
Formal Structure of Sanskrit Text: Requirements Analysis for a Mechanical Sanskrit Processor

Sanskrit Computational Linguistics
Analysis of Sanskrit Text: Parsing and Semantic Relations

Sanskrit Computational Linguistics
Implementation of the Arabic numerals and their syntax in GF

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Simulating Finite Eilenberg Machines with a Reactive Engine

Electronic Notes in Theoretical Computer Science (ENTCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the Zen toolkit for morphological and phonological processing of natural languages. This toolkit is presented in literate programming style, in the Pidgin ML subset of the Objective Caml functional programming language. This toolkit is based on a systematic representation of finite state automata and transducers as decorated lexical trees. All operations on the state space data structures use the zipper technology, and a uniform sharing functor permits systematic maximum sharing as dags. A particular case of lexical maps is specially convenient for building invertible morphological operations such as inflected forms dictionaries, using a notion of differential word. As a particular application, we describe a general method for tagging a natural language text given as a phoneme stream by analysing possible euphonic liaisons between words belonging to a lexicon of inflected forms. The method uses the toolkit methodology by constructing a non-deterministic transducer, implementing rational rewrite rules, by mechanical decoration of a trie representation of the lexicon index. The algorithm is linear in the size of the lexicon. A coroutine interpreter is given, and its correctness and completeness are formally proved. An application to the segmentation of Sanskrit by sandhi analysis is demonstrated.