Algorithms for clustering data
Algorithms for clustering data
Database techniques for the World-Wide Web: a survey
ACM SIGMOD Record
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Algorithms for Model-Based Gaussian Hierarchical Clustering
SIAM Journal on Scientific Computing
CACTUS—clustering categorical data using summaries
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ACM Computing Surveys (CSUR)
ROCK: a robust clustering algorithm for categorical attributes
Information Systems
Probe, count, and classify: categorizing hidden web databases
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning
Evaluating contents-link coupled web page clustering for web search results
Proceedings of the eleventh international conference on Information and knowledge management
COOLCAT: an entropy-based algorithm for categorical clustering
Proceedings of the eleventh international conference on Information and knowledge management
MedMaker: A Mediation System Based on Declarative Specifications
ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Information Integration Using Logical Views
ICDT '97 Proceedings of the 6th International Conference on Database Theory
Determining Text Databases to Search in the Internet
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Clustering categorical data: an approach based on dynamical systems
The VLDB Journal — The International Journal on Very Large Data Bases
Statistical schema matching across web query interfaces
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Understanding Web query interfaces: best-effort parsing with hidden syntax
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Discovering complex matchings across web query interfaces: a correlation mining approach
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured databases on the web: observations and implications
ACM SIGMOD Record
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
An experimental comparison of several clustering and initialization methods
UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
Towards Building a MetaQuerier: Extracting and Matching Web Query Interfaces
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Clustering e-commerce search engines based on their search interface pages using WISE-cluster
Data & Knowledge Engineering - Special issue: WIDM 2004
Combining classifiers to identify online databases
Proceedings of the 16th international conference on World Wide Web
Learning to extract form labels
Proceedings of the VLDB Endowment
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Site-Wide Wrapper Induction for Life Science Deep Web Databases
DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Semantic clustering of XML documents
ACM Transactions on Information Systems (TOIS)
Generation of Specifications Forms through Statistical Learning for a Universal Services Marketplace
WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Clustering deep web databases semantically
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Finding and using the content texts of HTML pages
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Semantics-guided clustering of heterogeneous XML schemas
Journal on data semantics IX
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On building a search interface discovery system
RED'09 Proceedings of the 2nd international conference on Resource discovery
Domain-independent classification for deep web interfaces
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Using chi-square statistics to measure similarities for text categorization
Expert Systems with Applications: An International Journal
Measuring similarity of chinese web databases based on category hierarchy
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Automatic hierarchical classification of structured deep web databases
WISE'06 Proceedings of the 7th international conference on Web Information Systems
TODWEB: training-less ontology based deep web source classification
Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
An approach for clustering semantically heterogeneous XML schemas
OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
Clustering Wikipedia infoboxes to discover their types
Proceedings of the 21st ACM international conference on Information and knowledge management
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases
Journal of Intelligent Information Systems
Assessing relevance and trust of the deep web sources and results based on inter-source agreement
ACM Transactions on the Web (TWEB)
Automatic classification of web databases using domain-dictionaries
MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Hi-index | 0.00 |
In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas ie, attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of categorical data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation over hundreds of real sources indicates that (1) the schema-based clustering accurately organizes sources by object domains eg, Books, Movies), and (2) on clustering Web query schemas, the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm.