Statistically-driven alignment-based multiword expression identification for technical domains

Authors:
Helena de Medeiros Caseli;Aline Villavicencio;André Machado;Maria José Finatto
Affiliations:
Federal University of São Carlos, Brazil;Federal University of Rio Grande do Sul, Brazil and Bath University, UK;Federal University of Rio Grande do Sul, Brazil;Federal University of Rio Grande do Sul, Brazil
Venue:
MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Year:
2009

Citing 17
Cited 5

The derivation of a large computational lexicon for English from LDOCE

Computational lexicography for natural language processing
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Multiword Expressions: A Pain in the Neck for NLP

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
On building a more efficient grammar by exploiting types

Natural Language Engineering
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Extracting the unextractable: a case study on verb-particles

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Picking them up and figuring them out: verb-particle constructions, noise and idiomaticity

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Automated multiword expression prediction for grammar engineering

MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures

MWE '07 Proceedings of the Workshop on a Broader Perspective on Multiword Expressions
Semantics-based multiword expression extraction

MWE '07 Proceedings of the Workshop on a Broader Perspective on Multiword Expressions
Deep lexical acquisition of verb-particle constructions

Computer Speech and Language
The availability of verb-particle constructions in lexical resources: How much is enough?

Computer Speech and Language
Using small random samples for the manual evaluation of statistical association measures

Computer Speech and Language
The design, implementation, and use of the Ngram statistics package

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Open-Source portuguese–spanish machine translation

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language

Extraction of multi-word expressions from small parallel corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Detecting noun compounds and light verb constructions: a contrastive study

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
A hybrid approach for multiword expression identification

PROPOR'10 Proceedings of the 9th international conference on Computational Processing of the Portuguese Language
A cascaded classification approach to semantic head recognition

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Extraction of multi-word expressions from small parallel corpora

Natural Language Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multiword Expressions (MWEs) are one of the stumbling blocks for more precise Natural Language Processing (NLP) systems. Particularly, the lack of coverage of MWEs in resources can impact negatively on the performance of tasks and applications, and can lead to loss of information or communication errors. This is especially problematic in technical domains, where a significant portion of the vocabulary is composed of MWEs. This paper investigates the use of a statistically-driven alignment-based approach to the identification of MWEs in technical corpora. We look at the use of several sources of data, including parallel corpora, using English and Portuguese data from a corpus of Pediatrics, and examining how a second language can provide relevant cues for this tasks. We report results obtained by a combination of statistical measures and linguistic information, and compare these to the reported in the literature. Such an approach to the (semi-)automatic identification of MWEs can considerably speed up lexicographic work, providing a more targeted list of MWE candidates.