Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites

Authors:
H. Davulcu;S. Koduri;S. Nagarajan
Affiliations:
Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ
Venue:
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Year:
2003

Citing 11
Cited 1

Information storage and retrieval

Information storage and retrieval
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Automatic repairing of web wrappers

Proceedings of the 3rd international workshop on Web information and data management
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Reverse Engineering for Web Data: From Visual to Semantic Structures

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

A Learning Approach to Discovering Web Page Semantic Structures

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are "taxonomy-directed". A Web site is said to be "taxonomy-directed" if it contains at least one taxonomy for organizing its contents and it presents the instances belonging to a category in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract products from taxonomy-directed online catalogs. DataRover utilizes heuristic rules to discover the structural regularities among: taxonomy segments, list-of-product and single-product pages and it uses these regularities to turn the online catalogs into a database of categorized products without the need for user interaction or the wrapper maintenance burden. We provide experimental results to demonstrate the efficacy of the DataRover and point to its current limitations.