Using Link-Based Content Analysis to Measure Document Similarity Effectively

Authors:
Pei Li;Zhixu Li;Hongyan Liu;Jun He;Xiaoyong Du
Affiliations:
Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Department of Management Science and Engineering, Tsinghua University, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China
Venue:
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Year:
2009

Citing 14
Cited 1

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
A vector space model for automatic indexing

Communications of the ACM
Probabilistic combination of content and links

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
SimRank: a measure of structural-context similarity

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Phrase-based Document Similarity Based on an Index Graph Model

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An information-theoretic measure for document similarity

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A personalized collaborative digital library environment: a model and an application

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
Scaling link-based similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
SimFusion: measuring similarity using unified relationship matrix

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Link mining: a survey

ACM SIGKDD Explorations Newsletter
LinkClus: efficient clustering via heterogeneous semantic links

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Combining content and link for classification using matrix factorization

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

A Neighborhood Search Method for Link-Based Tag Clustering

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications

Quantified Score

Hi-index	0.02

Visualization

Abstract

Along with a massive amount of information being placed online, it is a challenge to exploit the internal and external information of documents when assessing similarity between them. A variety of approaches have been proposed to model the document similarity based on different foundations, but usually they are not applicable for combining internal and external information. In this paper, we introduce a link-based method into content analysis, which is based on random walk on graphs. By defining similarity as the meeting probability of two random surfers, we propose a computational model for content analysis, which can also be integrated with external information of documents. Empirical study shows that our method achieves good accuracy, acceptable performance and fast convergent rate in multi-relational document similarity measuring.