Approximate searching for distributional similarity

Authors:
James Gorman;James R. Curran
Affiliations:
University of Sydney, NSW, Australia;University of Sydney, NSW, Australia
Venue:
DeepLA '05 Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition
Year:
2005

Citing 8
Cited 2

Class-based n-gram models of natural language

Computational Linguistics
Searching in metric spaces

ACM Computing Surveys (CSUR)
Explorations in Automatic Thesaurus Discovery

Explorations in Automatic Thesaurus Discovery
Navigating massive data sets via local clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling context space

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Class-based probability estimation using a semantic hierarchy

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Robust, applied morphological generation

INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Improvements in automatic thesaurus extraction

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9

Scaling distributional similarity to large corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Automatically extracting nominal mentions of events with a bootstrapped probabilistic classifier

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributional similarity requires large volumes of data to accurately represent infrequent words. However, the nearest-neighbour approach to finding synonyms suffers from poor scalability. The Spatial Approximation Sample Hierarchy (SASH), proposed by Houle (2003b), is a data structure for approximate nearest-neighbour queries that balances the efficiency/approximation trade-off. We have intergrated this into an existing distributional similarity system, tripling efficiency with a minor accuracy penalty.