Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features

Authors:
Chikara Hashimoto;Daisuke Kawahara
Affiliations:
Yamagata University, Yonezawa, Yamagata, Japan;National Institute of Information and Communications Technology, Sorakugun, Kyoto, Japan
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 12
Cited 3

The nature of statistical learning theory

The nature of statistical learning theory
The role of domain information in Word Sense Disambiguation

Natural Language Engineering
Automatic identification of non-compositional phrases

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
An empirical model of multiword expression decomposability

MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Japanese idiom recognition: drawing a line between literal and idiomatic meanings

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Construction of domain dictionary for fundamental vocabulary

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
MWEs as non-propositional content indicators

MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing
Automatic identification of non-compositional multi-word expressions using latent semantic analysis

MWE '06 Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties
Pulling their weight: exploiting syntactic forms for the automatic identification of idiomatic expressions in context

MWE '07 Proceedings of the Workshop on a Broader Perspective on Multiword Expressions
Disambiguating Japanese compound verbs

Computer Speech and Language

Verb noun construction MWE token supervised classification

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Handling sparsity for verb noun MWE token classification

GEMS '09 Proceedings of the Workshop on Geometrical Models of Natural Language Semantics
Linguistic cues for distinguishing literal and non-literal usages

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Some phrases can be interpreted either idiomatically (figuratively) or literally in context, and the precise identification of idioms is indispensable for full-fledged natural language processing (NLP). To this end, we have constructed an idiom corpus for Japanese. This paper reports on the corpus and the results of an idiom identification experiment using the corpus. The corpus targets 146 ambiguous idioms, and consists of 102, 846 sentences, each of which is annotated with a literal/idiom label. For idiom identification, we targeted 90 out of the 146 idioms and adopted a word sense disambiguation (WSD) method using both common WSD features and idiom-specific features. The corpus and the experiment are the largest of their kind, as far as we know. As a result, we found that a standard supervised WSD method works well for the idiom identification and achieved an accuracy of 89.25% and 88.86% with/without idiom-specific features and that the most effective idiom-specific feature is the one involving the adjacency of idiom constituents.