Document similarity based on concept tree distance

Authors:
Praveen Lakkaraju;Susan Gauch;Mirco Speretta
Affiliations:
University of Kansas, Lawrence, KS, USA;University of Arkansas, Fayetteville, AR, USA;University of Arkansas, Fayetteville, AR, USA
Venue:
Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
Year:
2008

Citing 12
Cited 8

Achieving application requirements

Distributed systems
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
The Earth Mover's Distance as a Metric for Image Retrieval

International Journal of Computer Vision
New algorithm for ordered tree-to-tree correction problem

Journal of Algorithms
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Algorithms on Trees and Graphs

Algorithms on Trees and Graphs
Exploiting hierarchical domain structure to compute similarity

ACM Transactions on Information Systems (TOIS)
SimRank: a measure of structural-context similarity

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Term Weighting Approaches in Automatic Text Retrieval

Term Weighting Approaches in Automatic Text Retrieval
The earth mover's distance as a semantic measure for document similarity

Proceedings of the 14th ACM international conference on Information and knowledge management
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Detecting similar Java classes using tree algorithms

Proceedings of the 2006 international workshop on Mining software repositories

Concept-Based Document Recommendations for CiteSeer Authors

AH '08 Proceedings of the 5th international conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Mining Hidden Concepts for Ontology Extension Using Multivariate Probabilistic Modeling

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Measuring similarity of chinese web databases based on category hierarchy

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
An event-centric model for multilingual document similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
ImpactWheel: Visual Analysis of the Impact of Online News

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Indexing for subtree similarity-search using edit distance

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Automated crime report analysis and classification for e-government and decision support

Proceedings of the 14th Annual International Conference on Digital Government Research
Near duplicate detection in an academic digital library

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.