Information storage and retrieval
Information storage and retrieval
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Automatic repairing of web wrappers
Proceedings of the 3rd international workshop on Web information and data management
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Reverse Engineering for Web Data: From Visual to Semantic Structures
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
A Learning Approach to Discovering Web Page Semantic Structures
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Hi-index | 0.00 |
The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are "taxonomy-directed". A Web site is said to be "taxonomy-directed" if it contains at least one taxonomy for organizing its contents and it presents the instances belonging to a category in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract products from taxonomy-directed online catalogs. DataRover utilizes heuristic rules to discover the structural regularities among: taxonomy segments, list-of-product and single-product pages and it uses these regularities to turn the online catalogs into a database of categorized products without the need for user interaction or the wrapper maintenance burden. We provide experimental results to demonstrate the efficacy of the DataRover and point to its current limitations.