Comparing and combining a semantic tagger and a statistical tool for MWE extraction

Authors:
Scott Songlin Piao;Paul Rayson;Dawn Archer;Tony McEnery
Affiliations:
Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, United Kingdom;Computing Department, Lancaster University, Lancaster LA1 4YT, United Kingdom;Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, United Kingdom;Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, United Kingdom
Venue:
Computer Speech and Language
Year:
2005

Citing 9
Cited 7

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Foundations of statistical natural language processing

Foundations of statistical natural language processing
REVERE: Support for Requirements Synthesis from Documents

Information Systems Frontiers
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
Termight: identifying and translating technical terminology

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
High-performance bilingual text alignment using statistical and dictionary information

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Detecting novel compounds: the role of distributional evidence

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Noun-noun compound machine translation: a feasibility study on shallow processing

MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Multiword unit hybrid extraction

MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18

A Hybrid Approach to Improve Bilingual Multiword Expression Extraction

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Measuring MWE compositionality using semantic annotation

MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
Improving statistical machine translation using domain bilingual multiword expressions

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
A hybrid framework to extract bilingual multiword expression from free text

Expert Systems with Applications: An International Journal
Extraction of multi-word expressions from small parallel corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Identification of multi-word expressions by combining multiple linguistic information sources

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Extraction of multi-word expressions from small parallel corpora

Natural Language Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7-12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.