Attention, intentions, and the structure of discourse
Computational Linguistics
Subtopic structuring for full-length document access
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Passage-level evidence in document retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Effective ranking with arbitrary passages
Journal of the American Society for Information Science and Technology
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Modern Information Retrieval
Introduction to Algorithms
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Measuring Structural Similarity Among Web Documents: Preliminary Results
EP '98/RIDT '98 Proceedings of the 7th International Conference on Electronic Publishing, Held Jointly with the 4th International Conference on Raster Imaging and Digital Typography: Electronic Publishing, Artistic Imaging, and Digital Typography
A Novel Method for Detecting Similar Documents
HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 4 - Volume 4
Models and Algorithms for Duplicate Document Detection
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
An information-theoretic measure for document similarity
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Language Modeling for Information Retrieval
Language Modeling for Information Retrieval
Multi-paragraph segmentation of expository text
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Cohesion and collocation: using context vectors in text segmentation
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
The SMART Retrieval System—Experiments in Automatic Document Processing
The SMART Retrieval System—Experiments in Automatic Document Processing
Video clip retrieval by maximal matching and optimal matching in graph theory
ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2
Computer-based plagiarism detection methods and tools: an overview
CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Exploiting maximal redundancy to optimize SQL queries
Knowledge and Information Systems
Semi-automated schema integration with SASMINT
Knowledge and Information Systems
Hi-index | 0.00 |
Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures.