Detecting complex predicates in Hindi using POS projection across parallel corpora

  • Authors:
  • Amitabha Mukerjee;Ankit Soni;Achla M. Raina

  • Affiliations:
  • Indian Institute of Technology Kanpur, Kanpur, India;Indian Institute of Technology Kanpur, Kanpur, India;Indian Institute of Technology Kanpur, Kanpur, India

  • Venue:
  • MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Complex Predicates or CPs are multiword complexes functioning as single verbal units. CPs are particularly pervasive in Hindi and other Indo-Aryan languages, but an usage account driven by corpus-based identification of these constructs has not been possible since single-language systems based on rules and statistical approaches require reliable tools (POS taggers, parsers, etc.) that are unavailable for Hindi. This paper highlights the development of first such database based on the simple idea of projecting POS tags across an English-Hindi parallel corpus. The CP types considered include adjective-verb (AV), noun-verb (NV), adverb-verb (Adv-V), and verb-verb (VV) composites. CPs are hypothesized where a verb in English is projected onto a multi-word sequence in Hindi. While this process misses some CPs, those that are detected appear to be more reliable (83% precision, 46% recall). The resulting database lists usage instances of 1439 CPs in 4400 sentences.