Scaling distributional similarity to large corpora

Authors:
James Gorman;James R. Curran
Affiliations:
University of Sydney, NSW, Australia;University of Sydney, NSW, Australia
Venue:
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Year:
2006

Citing 16
Cited 19

Sparse distributed memory and related models

Associative neural memories
Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming

Journal of the ACM (JACM)
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Some approaches to best-match file searching

Communications of the ACM
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Explorations in Automatic Thesaurus Discovery

Explorations in Automatic Thesaurus Discovery
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Navigating massive data sets via local clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Automatic bilingual lexicon acquisition using random indexing of parallel corpora

Natural Language Engineering
Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity

Computational Linguistics
Improvements in automatic thesaurus extraction

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Approximate searching for distributional similarity

DeepLA '05 Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition

An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Superior and efficient fully unsupervised pattern-based concept acquisition using an unsupervised parser

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Representing words as regions in vector space

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Translation and extension of concepts across languages

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Random indexing using statistical weight functions

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Knowledge derived from wikipedia for computing semantic relatedness

Journal of Artificial Intelligence Research
The topology of synonymy and homonymy networks

CACLA '07 Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition
Comparing Different Properties Involved in Word Similarity Extraction

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Geo-mining: discovery of road and transport networks using directional patterns

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Enhancement of lexical concepts using cross-lingual web mining

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Exemplar-based models for word meaning in context

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Distributional similarity vs. PU learning for entity set expansion

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
A mixture model with sharing for lexical semantics

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Automatically acquiring a semantic network of related concepts

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Evaluating a semantic network automatically constructed from lexical co-occurrence on a word sense disambiguation task

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Cross-cutting models of lexical semantics

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
SemEval-2012 task 4: evaluating Chinese word similarity

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the naïve nearest-neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2) in the vocabulary size).In this paper, we compare several existing approaches to approximating the nearest-neighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) provides the best balance.