Extracting and classifying Urdu multiword expressions

Authors:
Annette Hautli;Sebastian Sulger
Affiliations:
University of Konstanz, Germany;University of Konstanz, Germany
Venue:
HLT-SS '11 Proceedings of the ACL 2011 Student Session
Year:
2011

Citing 4
Cited 0

Detecting complex predicates in Hindi using POS projection across parallel corpora

MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
Exploiting translational correspondences for pattern-independent MWE identification

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Mining complex predicates in Hindi using a parallel Hindi-English corpus

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
The design, implementation, and use of the Ngram statistics package

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes a method for automatically extracting and classifying multiword expressions (mWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The mWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of mWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where mWEs are needed to generate a more precise syntactic and semantic analysis.