A novel document similarity measure based on earth mover's distance

Authors:
Xiaojun Wan
Affiliations:
Institute of Computer Science and Technology, Peking University, Beijing 100871, China
Venue:
Information Sciences: an International Journal
Year:
2007

Citing 19
Cited 10

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The Earth Mover's Distance as a Metric for Image Retrieval

International Journal of Computer Vision
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A new algorithm for computing similarity between RNA structures

Information Sciences: an International Journal
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
On Effective Conceptual Indexing and Similarity Search in Text Data

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A new polynomial-time algorithm for linear programming

STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing
An information-theoretic measure for document similarity

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Cohesion and collocation: using context vectors in text segmentation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Measuring similarity of semi-structured documents with context weights

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Ontology-based concept similarity in Formal Concept Analysis

Information Sciences: an International Journal
A similarity measure for fuzzy rulebases based on linguistic gradients

Information Sciences: an International Journal

A new sentence similarity measure and sentence based extractive technique for automatic text summarization

Expert Systems with Applications: An International Journal
SubSpace Projection: A unified framework for a class of partition-based dimension reduction techniques

Information Sciences: an International Journal
Exploiting noun phrases and semantic relationships for text document clustering

Information Sciences: an International Journal
Structural and semantic aspects of similarity of Document Type Definitions and XML schemas

Information Sciences: an International Journal
Validation of overlapping clustering: A random clustering perspective

Information Sciences: an International Journal
An efficient mechanism for processing similarity search queries in sensor networks

Information Sciences: an International Journal
Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach

IEEE Transactions on Fuzzy Systems
Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization

Expert Systems with Applications: An International Journal
WS-Finder: a framework for similarity search of web services

ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
On combining text-based and link-based similarity measures for scientific papers

Proceedings of the 2013 Research in Adaptive and Convergent Systems

Quantified Score

Hi-index	0.07

Visualization

Abstract

In this paper we propose a novel measure based on the earth mover's distance (EMD) to evaluate document similarity by allowing many-to-many matching between subtopics. First, each document is decomposed into a set of subtopics, and then the EMD is employed to evaluate the similarity between two sets of subtopics for two documents by solving the transportation problem. The proposed measure is an improvement of the previous OM-based measure, which allows only one-to-one matching between subtopics. Experiments have been performed on the TDT3 dataset to evaluate existing similarity measures and the results show that the EMD-based measure outperforms the optimal matching (OM) based measure and all other measures. In addition to the TextTiling algorithm, the sentence clustering algorithm is adopted for document decomposition, and the experimental results show that the proposed EMD-based measure does not rely on the document decomposition algorithm and thus it is more robust than the OM-based measure.