Foundations of statistical natural language processing
Foundations of statistical natural language processing
Two-stage language models for information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Multiword Expressions: A Pain in the Neck for NLP
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Retrieving collocations from text: Xtract
Computational Linguistics - Special issue on using large corpora: I
Automatic identification of non-compositional phrases
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Combining the language model and inference network approaches to retrieval
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Methods for the qualitative evaluation of lexical association measures
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
An empirical model of multiword expression decomposability
MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Automatic identification of non-compositional multi-word expressions using latent semantic analysis
MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
MWE '07 Proceedings of the Workshop on a Broader Perspective on Multiword Expressions
Statistically-driven alignment-based multiword expression identification for technical domains
MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Verb noun construction MWE token supervised classification
MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Linguistic cues for distinguishing literal and non-literal usages
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Developing a test collection for the evaluation of integrated search
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Unsupervised feature adaptation for cross-domain NLP with an application to compositionality grading
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Hi-index | 0.00 |
Most NLP systems use tokenization as part of preprocessing. Generally, tokenizers are based on simple heuristics and do not recognize multi-word units (MWUs) like hot dog or black hole unless a precompiled list of MWUs is available. In this paper, we propose a new cascaded model for detecting MWUs of arbitrary length for tokenization, focusing on noun phrases in the physics domain. We adopt a classification approach because -- unlike other work on MWUs -- tokenization requires a completely automatic approach. We achieve an accuracy of 68% for recognizing non-compositional MWUs and show that our MWU recognizer improves retrieval performance when used as part of an information retrieval system.