Organizing structured web sources by query schemas: a clustering approach

Authors:
Bin He;Tao Tao;Kevin Chen-Chuan Chang
Affiliations:
University of Illinois at Urbana-Champaign, IL;University of Illinois at Urbana-Champaign, IL;University of Illinois at Urbana-Champaign, IL
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 24
Cited 24

Algorithms for clustering data

Algorithms for clustering data
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Algorithms for Model-Based Gaussian Hierarchical Clustering

SIAM Journal on Scientific Computing
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning

Machine Learning
Evaluating contents-link coupled web page clustering for web search results

Proceedings of the eleventh international conference on Information and knowledge management
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
MedMaker: A Mediation System Based on Declarative Specifications

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Information Integration Using Logical Views

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Determining Text Databases to Search in the Internet

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Discovering complex matchings across web query interfaces: a correlation mining approach

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured databases on the web: observations and implications

ACM SIGMOD Record
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
An experimental comparison of several clustering and initialization methods

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Towards Building a MetaQuerier: Extracting and Matching Web Query Interfaces

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Clustering e-commerce search engines based on their search interface pages using WISE-cluster

Data & Knowledge Engineering - Special issue: WIDM 2004
Combining classifiers to identify online databases

Proceedings of the 16th international conference on World Wide Web
Towards Deeper Understanding of the Search Interfaces of the Deep Web

World Wide Web
Learning to extract form labels

Proceedings of the VLDB Endowment
Towards a universal marketplace over the web: statistical multi-label classification of service provider forms with simulated annealing

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Semantic clustering of XML documents

ACM Transactions on Information Systems (TOIS)
Generation of Specifications Forms through Statistical Learning for a Universal Services Marketplace

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Clustering deep web databases semantically

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Semantics-guided clustering of heterogeneous XML schemas

Journal on data semantics IX
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On building a search interface discovery system

RED'09 Proceedings of the 2nd international conference on Resource discovery
Domain-independent classification for deep web interfaces

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Using chi-square statistics to measure similarities for text categorization

Expert Systems with Applications: An International Journal
Measuring similarity of chinese web databases based on category hierarchy

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Automatic hierarchical classification of structured deep web databases

WISE'06 Proceedings of the 7th international conference on Web Information Systems
TODWEB: training-less ontology based deep web source classification

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
An approach for clustering semantically heterogeneous XML schemas

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
Clustering Wikipedia infoboxes to discover their types

Proceedings of the 21st ACM international conference on Information and knowledge management
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Journal of Intelligent Information Systems
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Automatic classification of web databases using domain-dictionaries

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas ie, attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of categorical data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation over hundreds of real sources indicates that (1) the schema-based clustering accurately organizes sources by object domains eg, Books, Movies), and (2) on clustering Web query schemas, the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm.