Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Authors:
Xiaojun Wan
Affiliations:
Peking University, Institute of Computer Science and Technology, 100871, Beijing, China
Venue:
Knowledge and Information Systems
Year:
2008

Citing 20
Cited 3

Attention, intentions, and the structure of discourse

Computational Linguistics
Subtopic structuring for full-length document access

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Passage-level evidence in document retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Effective ranking with arbitrary passages

Journal of the American Society for Information Science and Technology
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Introduction to Algorithms

Introduction to Algorithms
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Measuring Structural Similarity Among Web Documents: Preliminary Results

EP '98/RIDT '98 Proceedings of the 7th International Conference on Electronic Publishing, Held Jointly with the 4th International Conference on Raster Imaging and Digital Typography: Electronic Publishing, Artistic Imaging, and Digital Typography
A Novel Method for Detecting Similar Documents

HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 4 - Volume 4
Models and Algorithms for Duplicate Document Detection

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
An information-theoretic measure for document similarity

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Cohesion and collocation: using context vectors in text segmentation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Video clip retrieval by maximal matching and optimal matching in graph theory

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2

Computer-based plagiarism detection methods and tools: an overview

CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Exploiting maximal redundancy to optimize SQL queries

Knowledge and Information Systems
Semi-automated schema integration with SASMINT

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures.