Stratified k-means clustering over a deep web data source

Authors:
Tantan Liu;Gagan Agrawal
Affiliations:
Ohio State University, Columbus, USA;Ohio State University, Columbus, USA
Venue:
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2012

Citing 25
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Automatic integration of Web search interfaces with WISE-Integrator

The VLDB Journal — The International Journal on Very Large Data Bases
Fast and Exact Out-of-Core K-Means Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Outlier detection by active learning

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Outlier detection by sampling with accuracy guarantees

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Machine Learning
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Scalable multi-query optimization for exploratory queries over federated scientific databases

Proceedings of the VLDB Endowment
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Optimization of multi-domain queries on the web

Proceedings of the VLDB Endowment
Learning to create data-integrating queries

Proceedings of the VLDB Endowment
Google's Deep Web crawl

Proceedings of the VLDB Endowment
A Randomized Approach for Approximating the Number of Frequent Sets

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Stability of k-means clustering

COLT'07 Proceedings of the 20th annual conference on Learning theory
Stratified Sampling for Data Mining on the Deep Web

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Active learning based frequent itemset mining over the deep web

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Least squares quantization in PCM

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper focuses on the problem of clustering data from a {\em hidden} or a deep web data source. A key characteristic of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. We have developed a new stratified clustering method addressing this problem for a deep web data source. Specifically, we have developed a stratified k-means clustering method. In our approach, the space of input attributes of a deep web data source is stratified for capturing the relationship between the input and the output attributes. The space of output attributes of a deep web data source is partitioned into sub-spaces. Three representative sampling methods are developed in this paper, with the goal of achieving a good estimation of the statistics, including proportions and centers, within the sub-spaces of the output attributes. We have evaluated our methods using two synthetic and two real datasets. Our comparison shows significant gains in estimation accuracy from both the novel aspects of our work, i.e., the use of stratification(5%-55%), and our and representative sampling methods(up to 54%).