Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

Authors:
Deepak Ravichandran;Patrick Pantel;Eduard Hovy
Affiliations:
University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA
Venue:
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Year:
2005

Citing 13
Cited 51

Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming

Journal of the ACM (JACM)
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Noun classification from predicate-argument structures

ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
PRINCIPAR: an efficient, broad-coverage, principle-based parser

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Scaling context space

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

Very sparse random projections

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling distributional similarity to large corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Semantic taxonomy induction from heterogenous evidence

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A large-scale exploration of effective global features for a joint entity detection and tracking model

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Using sketches to estimate associations

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Googleology is Bad Science

Computational Linguistics
A topology-aware hierarchical structured overlay network based on locality sensitive hashing scheme

Proceedings of the second workshop on Use of P2P, GRID and agents for the development of content networks
Very sparse stable random projections for dimension reduction in lα (0

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Locality sensitive hash functions based on concomitant rank order statistics

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Client-Friendly Classification over Random Hyperplane Hashes

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Acquiring paraphrases from text corpora

Proceedings of the fifth international conference on Knowledge capture
Streaming for large scale NLP: language modeling

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Phonetic models for generating spelling variants

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
TACO: tunable approximate computation of outliers in wireless sensor networks

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Similarity search and locality sensitive hashing using ternary content addressable memories

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Random hyperplane projection using derived dimensions

Proceedings of the Ninth ACM International Workshop on Data Engineering for Wireless and Mobile Access
Streaming first story detection with application to Twitter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Online generation of locality sensitive hash signatures

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
Sketching techniques for large scale NLP

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Sketch techniques for scaling distributional similarity to the web

GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Generating phrasal and sentential paraphrases: A survey of data-driven methods

Computational Linguistics
Resources for Turkish morphological processing

Language Resources and Evaluation
Efficient online locality sensitive hashing via reservoir counting

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Improving random projections using marginal information

COLT'06 Proceedings of the 19th annual conference on Learning Theory
Generating semantic orientation lexicon using large data and thesaurus

WASSA '11 Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis
Distributed similarity estimation using derived dimensions

The VLDB Journal — The International Journal on Very Large Data Bases
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
Reranking bilingually extracted paraphrases using monolingual distributional similarity

GEMS '11 Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics
Approximate scalable bounded space sketch for large data NLP

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised semantic role induction with graph partitioning

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Evaluation method for automated wordnet expansion

SIIS'11 Proceedings of the 2011 international conference on Security and Intelligent Information Systems
Space efficiencies in discourse modeling via conditional random sampling

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Monolingual distributional similarity for text-to-text generation

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Coreference semantics from web features

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Streaming analysis of discourse participants

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Fast large-scale approximate graph construction for NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch algorithms for estimating point queries in NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Annotated Gigaword

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Content-based crowd retrieval on the real-time web

Proceedings of the 21st ACM international conference on Information and knowledge management
KORE: keyphrase overlap relatedness for entity disambiguation

Proceedings of the 21st ACM international conference on Information and knowledge management
Efficient Nearest-Neighbor Search in the Probability Simplex

Proceedings of the 2013 Conference on the Theory of Information Retrieval
In-network approximate computation of outliers with quality guarantees

Information Systems
Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny)

ACM Transactions on Computation Theory (TOCT)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed.