Measuring similarity of word meaning in context with lexical substitutes and translations

Authors:
Diana McCarthy
Affiliations:
Lexical Computing Ltd., Brighton
Venue:
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Year:
2011

Citing 14
Cited 1

Word sense disambiguation and information retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Selection and information: a class-based approach to lexical relationships

Selection and information: a class-based approach to lexical relationships
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic word sense discrimination

Computational Linguistics - Special issue on word sense disambiguation
Word sense disambiguation vs. statistical machine translation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Meaningful clustering of senses helps boost word sense disambiguation performance

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Differentiating homonymy and polysemy in information retrieval

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
OntoNotes: the 90% solution

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
SemEval-2007 task 07: coarse-grained English all-words task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
SemEval-2007 task 10: English lexical substitution task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
SemEval-2007 task 11: English lexical sample task via English-Chinese parallel text

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Investigations on word senses and word usages

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
SemEval-2010 task 2: Cross-lingual lexical substitution

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
SemEval-2010 task 3: Cross-lingual word sense disambiguation

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation

The cross-lingual lexical substitution task

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Representation of word meaning has been a topic of considerable debate within the field of computational linguistics, and particularly in the subfield of word sense disambiguation. While word senses enumerated in manually produced inventories have been very useful as a start point to research, we know that the inventory should be selected for the purposes of the application. Unfortunately we have no clear understanding of how to determine the appropriateness of an inventory for monolingual applications, or when the target language is unknown in cross-lingual applications. In this paper we examine datasets which have paraphrases or translations as alternative annotations of lexical meaning on the same underlying corpus data. We demonstrate that overlap in lexical paraphrases (substitutes) between two uses of the same lemma correlates with overlap in translations. We compare the degree of overlap with annotations of usage similarity on the same data and show that the overlaps in paraphrases or translations also correlate with the similarity judgements. This bodes well for using any of these methods to evaluate unsupervised representations of lexical semantics. We do however find that the relationship breaks down for some lemmas, but this behaviour on a lemma by lemma basis itself correlates with low inter-tagger agreement and higher proportions of mid-range points on a usage similarity dataset. Lemmas which have many inter-related usages might potentially be predicted from such data.