Improving the accuracy of similarity measures by using link information

Authors:
Tijn Witsenburg;Hendrik Blockeel
Affiliations:
Leiden Institute of Advanced Computer Science, Universiteit Leiden, Leiden, The Netherlands;Leiden Institute of Advanced Computer Science, Universiteit Leiden, Leiden, The Netherlands and Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium
Venue:
ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Year:
2011

Citing 3
Cited 2

Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Graph clustering based on structural/attribute similarities

Proceedings of the VLDB Endowment

Collaborative similarity measure for intra graph clustering

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications
Learning in probabilistic graphs exploiting language-constrained patterns

NFMCP'12 Proceedings of the First international conference on New Frontiers in Mining Complex Patterns

Quantified Score

Hi-index	0.00

Visualization

Abstract

The notion of similarity is crucial to a number of tasks and methods in machine learning and data mining, including clustering and nearest neighbor classification. In many contexts, there is on the one hand a natural (but not necessarily optimal) similarity measure defined on the objects to be clustered or classified, but there is also information about which objects are linked together. This raises the question to what extent the information contained in the links can be used to obtain a more relevant similarity measure. Earlier research has already shown empirically that more accurate results can be obtained by including such link information, but it was not analyzed why this is the case. In this paper we provide such an analysis. We relate the extent to which improved results can be obtained to the notions of homophily in the network, transitivity of similarity, and content variability of objects. We explore this relationship using some randomly generated datasets, in which we vary the amount of homophily and content variability. The results show that within a fairly wide range of values for these parameters, the inclusion of link information in the similarity measure indeed yields improved results, as compared to computing the similarity of objects directly from their content.