Towards the web of concepts: extracting concepts from large datasets

Authors:
Aditya Parameswaran;Hector Garcia-Molina;Anand Rajaraman
Affiliations:
Stanford University;Stanford University;Kosmix Corporation
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 21
Cited 4

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Concept-based knowledge discovery in texts extracted from the Web

ACM SIGKDD Explorations Newsletter
Mining confident rules without support requirement

Proceedings of the tenth international conference on Information and knowledge management
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
A Statistical Corpus-Based Term Extractor

AI '01 Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Positive and Unlabeled Examples Help Learning

ALT '99 Proceedings of the 10th International Conference on Algorithmic Learning Theory
Mining Ontologies from Text

EKAW '00 Proceedings of the 12th European Workshop on Knowledge Acquisition, Modeling and Management
Web-Log Mining for Predictive Web Caching

IEEE Transactions on Knowledge and Data Engineering
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Data Mining and Knowledge Discovery
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Noun-phrase analysis in unrestricted text for information retrieval

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Can social bookmarking improve web search?

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Concept mining for indexing medical literature

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition

WebSets: extracting sets of entities from the web using unsupervised information extraction

Proceedings of the fifth ACM international conference on Web search and data mining
Assessing web article quality by harnessing collective intelligence

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Ontology acquisition from web service descriptions

Proceedings of the 28th Annual ACM Symposium on Applied Computing
CONCERT: a concept-centric web news recommendation system

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, we develop a novel technique to extract concepts from large datasets. We approach the problem of concept extraction from corpora as a market-basket problem, adapting statistical measures of support and confidence. We evaluate our concept extraction algorithm on datasets containing data from a large number of users (e.g., the AOL query log data set), and we show that a high-precision concept set can be extracted.