Exploiting Dataset Similarity for Distributed Mining

Authors:
Srinivasan Parthasarathy;Mitsunori Ogihara
Affiliations:
-;-
Venue:
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Year:
2000

Citing 9
Cited 2

Similarity-based queries

PODS '95 Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The KDD process for extracting useful knowledge from volumes of data

Communications of the ACM
Fast discovery of association rules

Advances in knowledge discovery and data mining
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Online Generation of Association Rules

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Proximity Search in Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
New Algorithms for Fast Discovery of Association Rules

New Algorithms for Fast Discovery of Association Rules

Association-based similarity testing and its applications

Intelligent Data Analysis
Cross domain similarity mining: research issues and potential applications including supporting research by analogy

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

The notion of similarity is an important one in data mining. It can be used to provide useful structural information on data as well as enable clustering. In this paper we present an elegant method for measuring the similarity between homogeneous datasets. The algorithm presented is efficient in storage and scale, has the ability to adjust to time constraints. and can provide the user with likely causes of similarity or dis-similarity. One potential application of our similarity measure is in the distributed data mining domain. Using the notion of similarity across databases as a distance metric one cangenerate clusters of similar datasets. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The similarity measure is evaluated on a dataset from the Census Bureau, and synthetic datasets from IBM.