Web-scale distributional similarity and entity set expansion

Authors:
Patrick Pantel;Eric Crestan;Arkady Borkovsky;Ana-Maria Popescu;Vishnu Vyas
Affiliations:
Yahoo! Labs, Sunnyvale, CA;Yahoo! Labs, Sunnyvale, CA;Yandex Labs, Burlingame, CA;Yahoo! Labs, Sunnyvale, CA;Yahoo! Labs, Sunnyvale, CA
Venue:
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Year:
2009

Citing 37
Cited 64

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Latent semantic space: iterative scaling improves precision of inter-document similarity measurement

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Measuring praise and criticism: Inference of semantic orientation from association

ACM Transactions on Information Systems (TOIS)
Latent dirichlet allocation

The Journal of Machine Learning Research
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Noun classification from predicate-argument structures

ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Collective information extraction with relational Markov networks

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Scaling distributional similarity to large corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Towards terascale knowledge acquisition

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds

Proceedings of the 16th international conference on World Wide Web
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
The effect of corpus size in combining supervised and unsupervised training for disambiguation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Weakly-supervised discovery of named entities using web search queries

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
"More like these": growing entity classes from seeds

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Context-aware query suggestion by mining click-through and session data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Language-Independent Set Expansion of Named Entities Using the Web

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Iterative Set Expansion of Named Entities Using the Web

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Towards intent-driven bidterm suggestion

Proceedings of the 18th international conference on World wide web
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
A structured vector space model for word meaning in context

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A study on similarity and relatedness using distributional and WordNet-based approaches

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Locating complex named entities in web text

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence

Helping editors choose better seed sets for entity set expansion

Proceedings of the 18th ACM conference on Information and knowledge management
A web service for automatic word class acquisition

Proceedings of the 3rd International Universal Communication Symposium
Entity extraction via ensemble semantics

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Minimally-supervised extraction of entities from text advertisements

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improved extraction assessment through better language models

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Not all seeds are equal: measuring the quality of text mining seeds

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The viability of web-derived polarity lexicons

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A Bayesian method for robust estimation of distributional similarities

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Experiments in graph-based semi-supervised learning methods for class-instance acquisition

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Learning arguments and supertypes of semantic relations using recursive patterns

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Distributional similarity vs. PU learning for entity set expansion

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
An active learning approach to finding related terms

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Nearest neighbor search: algorithmic perspective

SIGSPATIAL Special
Sketching techniques for large scale NLP

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Sketch techniques for scaling distributional similarity to the web

GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Unsupervised discovery of negative categories in lexicon bootstrapping

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
FactRank: random walks on a web of facts

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Open entity extraction from web search query logs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Co-STAR: a co-training style algorithm for hyponymy relation acquisition from structured and unstructured text

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Corpus-based semantic class mining: distributional vs. pattern-based approaches

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Grouping product features using semi-supervised learning with soft-constraints

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Clustering product features for opinion mining

Proceedings of the fourth ACM international conference on Web search and data mining
Best topic word selection for topic labelling

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Extracting and ranking product features in opinion documents

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Domain-independent entity extraction from web search query logs

Proceedings of the 20th international conference companion on World wide web
Entity set expansion in opinion documents

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
Jigs and lures: associating web queries with structured entities

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Nonlinear evidence fusion and propagation for hyponymy relation mining

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Efficient online locality sensitive hashing via reservoir counting

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Entity set expansion using topic information

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system

Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
Synthesizing high utility suggestions for rare web search queries

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Language models as representations for weakly-supervised NLP tasks

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Automatically building training examples for entity extraction

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
User Behaviors in Related Word Retrieval and New Word Detection: A Collaborative Perspective

ACM Transactions on Asian Language Information Processing (TALIP)
Asking what no one has asked before: using phrase similarities to generate synthetic web search queries

Proceedings of the 20th ACM international conference on Information and knowledge management
Generating semantic orientation lexicon using large data and thesaurus

WASSA '11 Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis
MSR-NLP entry in BioNLP Shared Task 2011

BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop
Spoken dialogue system based on information extraction using similarity of predicate argument structures

SIGDIAL '11 Proceedings of the SIGDIAL 2011 Conference
Approximate scalable bounded space sketch for large data NLP

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A domain-independent approach to finding related entities

Information Processing and Management: an International Journal
Mining market trend from blog titles based on lexical semantic similarity

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Efficient searching top-k semantic similar words

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Answering table queries on the web using column keywords

Proceedings of the VLDB Endowment
A framework for robust discovery of entity synonyms

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
ArabOnto: experimenting a new distributional approach for building Arabic ontological resources

International Journal of Metadata, Semantics and Ontologies
A semi-supervised approach to extracting multiword entity names from user reviews

Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search
Structuring e-commerce inventory

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
No noun phrase left behind: detecting and typing unlinkable entities

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Ensemble semantics for large-scale unsupervised relation extraction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Fast large-scale approximate graph construction for NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Bootstrapping biomedical ontologies for scientific text using NELL

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Recognizing arguing subjectivity and argument tags

ExProM '12 Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics
Learning to find comparable entities on the web

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Proceedings of the sixth ACM international conference on Web search and data mining
Automatic thesaurus construction for cross generation corpus

Journal on Computing and Cultural Heritage (JOCCH)
Fusing distributional and experiential information for measuring semantic relatedness

Information Fusion
Extracting query facets from search results

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Mining acronym expansions and their meanings using query click log

Proceedings of the 22nd international conference on World Wide Web
Tailoring the automated construction of large-scale taxonomies using the web

Language Resources and Evaluation
Dimension independent similarity computation

The Journal of Machine Learning Research
Acquisition of open-domain classes via intersective semantics

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms is computed in 50 hours using 200 quad-core nodes. We apply the learned similarity matrix to the task of automatic set expansion and present a large empirical study to quantify the effect on expansion performance of corpus size, corpus quality, seed composition and seed size. We make public an experimental testbed for set expansion analysis that includes a large collection of diverse entity sets extracted from Wikipedia.