Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites

  • Authors:
  • H. Davulcu;S. Koduri;S. Nagarajan

  • Affiliations:
  • Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ

  • Venue:
  • WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are "taxonomy-directed". A Web site is said to be "taxonomy-directed" if it contains at least one taxonomy for organizing its contents and it presents the instances belonging to a category in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract products from taxonomy-directed online catalogs. DataRover utilizes heuristic rules to discover the structural regularities among: taxonomy segments, list-of-product and single-product pages and it uses these regularities to turn the online catalogs into a database of categorized products without the need for user interaction or the wrapper maintenance burden. We provide experimental results to demonstrate the efficacy of the DataRover and point to its current limitations.