Learning a concept-based document similarity measure

Authors:
Lan Huang;David Milne;Eibe Frank;Ian H. Witten
Affiliations:
Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand;Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand;Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand;Department of Computer Science, University of Waikato, Private Bag 3105, Hamilton 3240, New Zealand
Venue:
Journal of the American Society for Information Science and Technology
Year:
2012

Citing 34
Cited 0

Instance-Based Learning Algorithms

Machine Learning
WordNet: a lexical database for English

Communications of the ACM
Images of similarity: a visual exploration of optimal similarity metrics and scaling properties of TREC topic-document sets

Journal of the American Society for Information Science
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Algorithm 457: finding all cliques of an undirected graph

Communications of the ACM
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Clustering Algorithms

Clustering Algorithms
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Inference for the Generalization Error

Machine Learning
A tutorial on support vector regression

Statistics and Computing
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Information Retrieval
A knowledge-based search engine powered by wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Building semantic kernels for text classification using wikipedia

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
Clustering Documents with Active Learning Using Wikipedia

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity

Bioinformatics
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Corpus-based and knowledge-based measures of text semantic similarity

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Text-to-text semantic similarity for automatic short answer grading

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
WikiWalk: random walks on Wikipedia for semantic relatedness

TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Human assessments of document similarity

Journal of the American Society for Information Science and Technology
Boosting for text classification with semantic features

WebKDD'04 Proceedings of the 6th international conference on Knowledge Discovery on the Web: advances in Web Mining and Web Usage Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document similarity measures are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine-learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance. © 2012 Wiley Periodicals, Inc.