Statistically-driven alignment-based multiword expression identification for technical domains

  • Authors:
  • Helena de Medeiros Caseli;Aline Villavicencio;André Machado;Maria José Finatto

  • Affiliations:
  • Federal University of São Carlos, Brazil;Federal University of Rio Grande do Sul, Brazil and Bath University, UK;Federal University of Rio Grande do Sul, Brazil;Federal University of Rio Grande do Sul, Brazil

  • Venue:
  • MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multiword Expressions (MWEs) are one of the stumbling blocks for more precise Natural Language Processing (NLP) systems. Particularly, the lack of coverage of MWEs in resources can impact negatively on the performance of tasks and applications, and can lead to loss of information or communication errors. This is especially problematic in technical domains, where a significant portion of the vocabulary is composed of MWEs. This paper investigates the use of a statistically-driven alignment-based approach to the identification of MWEs in technical corpora. We look at the use of several sources of data, including parallel corpora, using English and Portuguese data from a corpus of Pediatrics, and examining how a second language can provide relevant cues for this tasks. We report results obtained by a combination of statistical measures and linguistic information, and compare these to the reported in the literature. Such an approach to the (semi-)automatic identification of MWEs can considerably speed up lexicographic work, providing a more targeted list of MWE candidates.