Frequency estimates for statistical word similarity measures

Authors:
Egidio Terra;C. L. A. Clarke
Affiliations:
University of Waterloo;University of Waterloo
Venue:
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Year:
2003

Citing 12
Cited 45

Word association norms, mutual information, and lexicography

Computational Linguistics
Relevance feedback revisited

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Class-based n-gram models of natural language

Computational Linguistics
Similarity-Based Models of Word Cooccurrence Probabilities

Machine Learning - Special issue on natural language learning
Improving the effectiveness of information retrieval with local context analysis

ACM Transactions on Information Systems (TOIS)
The impact of corpus size on question answering performance

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Word clustering and disambiguation based on co-occurrence data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2

Scoring missing terms in information retrieval tasks

Proceedings of the thirteenth ACM international conference on Information and knowledge management
A comparison of LSA, wordNet and PMI-IR for predicting user click behavior

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions

MM&Sec '06 Proceedings of the 8th workshop on Multimedia and security
Similarity of Semantic Relations

Computational Linguistics
A statistical model for near-synonym choice

ACM Transactions on Speech and Language Processing (TSLP)
Expressing implicit semantic relations without supervision

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Fast computation of lexical affinity models

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Document representation and multilevel measures of document similarity

NAACL-DocConsortium '06 Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: doctoral consortium
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Recognition and classification of noun phrases in queries for effective retrieval

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
POLYPHONET: An advanced social network extraction system from the Web

Web Semantics: Science, Services and Agents on the World Wide Web
Acquiring Word Similarities with Higher Order Association Mining

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Lexical and Semantic Resources for NLP: From Words to Meanings

KES '08 Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III
Named entity recognition in biomedical texts using an HMM model

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Using hidden Markov random fields to combine distributional and pattern-based word clustering

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Computing term translation probabilities with generalized latent semantic analysis

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Graph-based word clustering using a web search engine

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Learning graph walk based similarity measures for parsed text

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploiting internal and external semantics for the clustering of short texts using world knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
Relieving Polysemy Problem for Synonymy Detection

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
New experiments in distributional representations of synonymy

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Taxonomy construction using compound similarity measure

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
A comparison of co-occurrence and similarity measures as simulations of context

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Text relatedness based on a word thesaurus

Journal of Artificial Intelligence Research
Graph-based clustering for computational linguistics: a survey

TextGraphs-5 Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing
Paraphrase alignment for synonym evidence discovery

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Knowledge-based sense disambiguation (almost) for all structures

Information Systems
Automatic discovery of word semantic relations using paraphrase alignment and distributional lexical semantics analysis

Natural Language Engineering
Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Distributional memory: A general framework for corpus-based semantics

Computational Linguistics
Measuring Chinese-English cross-lingual word similarity with HowNet and parallel corpus

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Is singular value decomposition useful for word similarity extraction?

Language Resources and Evaluation
A nearest-neighbor method for resolving PP-Attachment ambiguity

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Similarity of objects and the meaning of words

TAMC'06 Proceedings of the Third international conference on Theory and Applications of Models of Computation
Evaluation of analogical proportions through Kolmogorov complexity

Knowledge-Based Systems
Using COTS search engines and custom query strategies at CLEF

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions

Proceedings of the 21st international conference on World Wide Web
The CQC algorithm: cycling in graphs to semantically enrich and enhance a bilingual dictionary

Journal of Artificial Intelligence Research
Computational approaches to sentence completion

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A challenge set for advancing language modeling

WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
Supervised learning of semantic relatedness

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
A versatile tool for privacy-enhanced web search

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Graded relevance ranking for synonym discovery

Proceedings of the 22nd international conference on World Wide Web companion
Can back-of-the-book indexes be automatically created?

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical measures of word similarity have application in many areas of natural language processing, such as language modeling and information retrieval. We report a comparative study of two methods for estimating word co-occurrence frequencies required by word similarity measures. Our frequency estimates are generated from a terabyte-sized corpus of Web data, and we study the impact of corpus size on the effectiveness of the measures. We base the evaluation on one TOEFL question set and two practice questions sets, each consisting of a number of multiple choice questions seeking the best synonym for a given target word. For two question sets, a context for the target word is provided, and we examine a number of word similarity measures that exploit this context. Our best combination of similarity measure and frequency estimation method answers 6-8% more questions than the best results previously reported for the same question sets.