Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

Authors:
James Caverlee;Ling Liu;David Buttler
Affiliations:
-;-;-
Venue:
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Year:
2004

Citing 22
Cited 19

Term-weighting approaches in automatic text retrieval

Readings in information retrieval
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection

ACM Transactions on Information Systems (TOIS)
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Data clustering: a review

ACM Computing Surveys (CSUR)
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A vector space model for automatic indexing

Communications of the ACM
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Detection of Heterogeneities in a Multiple Text Database Environment

COOPIS '99 Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Criterion functions for document clustering

Criterion functions for document clustering
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Structured databases on the web: observations and implications

ACM SIGMOD Record
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

IEEE Transactions on Knowledge and Data Engineering
HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML

Data & Knowledge Engineering
AutoFeed: an unsupervised learning system for generating webfeeds

Proceedings of the 3rd international conference on Knowledge capture
Efficient, automatic web resource harvesting

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
A web content manipulation technique based on page Fragmentation

Journal of Network and Computer Applications
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques

IEEE Transactions on Parallel and Distributed Systems
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
CCReSD: concept-based categorisation of Hidden Web databases

International Journal of High Performance Computing and Networking
A methodical approach to extracting interesting objects from dynamic web pages

International Journal of Web and Grid Services
Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
Overview of autofeed: an unsupervised learning system for generating webfeeds

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Estimating deep web data source size by capture---recapture method

Information Retrieval
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Automatic generation of data types for classification of deep web sources

DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences
Discovering interesting information with advances in web technology

ACM SIGKDD Explorations Newsletter
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we introduce the concept of a QA-Pageletto refer to the content region in a dynamic page that containsquery matches. We present THOR, a scalable andefficient mining system for discovering and extracting QA-Pageletsfrom the Deep Web. A unique feature of THOR isits two-phase extraction framework. In the first phase, pagesfrom a deep web site are grouped into distinct clusters ofstructurally-similar pages. In the second phase, pages fromeach page cluster are examined through a subtree filteringalgorithm that exploits the structural and content similarityat subtree level to identify the QA-Pagelets.