HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML

Authors:
Vladimir Kovalev;Sourav S. Bhowmick;Sanjay Madria
Affiliations:
School of Computer Engineering, Division of Information Systems, Nanyang Technological University, Singapore 639798, Singapore;School of Computer Engineering, Division of Information Systems, Nanyang Technological University, Singapore 639798, Singapore;Department of Computer Science, University of Missouri-Rolla, Rolla
Venue:
Data & Knowledge Engineering
Year:
2005

Citing 15
Cited 3

Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The Araneus Web-based management system

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Information gathering in the World-Wide Web: the W3QL query language and the W3QS system

ACM Transactions on Database Systems (TODS)
A layered architecture for querying dynamic Web content

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A Machine Learning Approach to Building Domain-Specific Search Engines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
DEQUE: querying the deep web

Data & Knowledge Engineering

Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Detecting semantically correct changes to relevant unordered hidden web data

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the problem of extracting and transforming dynamically generated hyperlinked hidden web query results to XML. Our approach is based on the STALKER approach. As STALKER was designed to extract data from a single web page, it cannot handle a set of hyperlinked pages. We propose an algorithm called HW-Transform for transforming hidden web query results (also called QURE-Pagelets) to XML format using machine learning by extending STALKER to handle hyperlinked hidden web pages. One of the key features of our approach is that we identify and transform key attributes of query results into XML attributes. These key attributes facilitate applications such as change detection and data integration by efficiently identifying related or identical results. Based on the proposed algorithm, we have implemented a prototype system called HW-STALKER using Java. Our experiments demonstrate that HW-Transform shows acceptable performance for transforming QURE-Pagelets to XML.