QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

Authors:
James Caverlee;Ling Liu
Affiliations:
IEEE;IEEE Computer Society
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 21
Cited 3

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Concept decompositions for large sparse text data using clustering

Machine Learning
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Clustering validity checking methods: part II

ACM SIGMOD Record
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Probabilistic Approach to Metasearching with Adaptive Probing

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine

Web Semantics: Science, Services and Agents on the World Wide Web
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the QA-Pagelet as a fundamental data preparation technique for large-scale data analysis of the Deep Web. To support QA-Pagelet extraction, we present the Thor framework for sampling, locating, and partioning the QA-Pagelets from the Deep Web. Two unique features of the Thor framework are 1) the novel page clustering for grouping pages from a Deep Web source into distinct clusters of control-flow dependent pages and 2) the novel subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets within highly ranked page clusters. We evaluate the effectiveness of the Thor framework through experiments using both simulation and real data sets. We show that Thor performs well over millions of Deep Web pages and over a wide range of sources, including e-Commerce sites, general and specialized search engines, corporate Web sites, medical and legal resources, and several others. Our experiments also show that the proposed page clustering algorithm achieves low-entropy clusters, and the subtree filtering algorithm identifies QA-Pagelets with excellent precision and recall.