Algorithms for clustering data
Algorithms for clustering data
Selective Sampling Using the Query by Committee Algorithm
Machine Learning
Refining Initial Points for K-Means Clustering
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Progressive Sampling for Association Rules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets
IEEE Transactions on Knowledge and Data Engineering
The learning-curve sampling method applied to model-based clustering
The Journal of Machine Learning Research
Automatic integration of Web search interfaces with WISE-Integrator
The VLDB Journal — The International Journal on Very Large Data Bases
Fast and Exact Out-of-Core K-Means Clustering
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Outlier detection by active learning
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Outlier detection by sampling with accuracy guarantees
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Scalable multi-query optimization for exploratory queries over federated scientific databases
Proceedings of the VLDB Endowment
Mining search engine query logs via suggestion sampling
Proceedings of the VLDB Endowment
Optimization of multi-domain queries on the web
Proceedings of the VLDB Endowment
Learning to create data-integrating queries
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
A Randomized Approach for Approximating the Number of Frequent Sets
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Leveraging COUNT Information in Sampling Hidden Databases
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Stability of k-means clustering
COLT'07 Proceedings of the 20th annual conference on Learning theory
Stratified Sampling for Data Mining on the Deep Web
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Active learning based frequent itemset mining over the deep web
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Least squares quantization in PCM
IEEE Transactions on Information Theory
Hi-index | 0.00 |
This paper focuses on the problem of clustering data from a {\em hidden} or a deep web data source. A key characteristic of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. We have developed a new stratified clustering method addressing this problem for a deep web data source. Specifically, we have developed a stratified k-means clustering method. In our approach, the space of input attributes of a deep web data source is stratified for capturing the relationship between the input and the output attributes. The space of output attributes of a deep web data source is partitioned into sub-spaces. Three representative sampling methods are developed in this paper, with the goal of achieving a good estimation of the statistics, including proportions and centers, within the sub-spaces of the output attributes. We have evaluated our methods using two synthetic and two real datasets. Our comparison shows significant gains in estimation accuracy from both the novel aspects of our work, i.e., the use of stratification(5%-55%), and our and representative sampling methods(up to 54%).