U-REST: an unsupervised record extraction system

Authors:
Yuan Kui Shen;David R. Karger
Affiliations:
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 3
Cited 2

A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web

Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe a system that can extract recordstructures from web pages with no direct human supervision.Records are commonly occurring HTML-embedded data tuples that describe people, offered courses, products,company profiles, etc. We present a simplified frameworkfor studying the problem of unsupervised record extraction. one which separates the algorithms from the feature engineering.Our system, U-REST formalizes an approach tothe problem of unsupervised record extraction using a simple two-stage machine learning framework. The first stage involves clustering, where structurally similar regions are discovered, and the second stage involves classification, where discovered groupings (clusters of regions) are ranked by their likelihood of being records. In our work, we describe, and summarize the results of an extensive survey of features for both stages. We conclude by comparing U-REST to related systems. The results of our empirical evaluation show encouraging improvements in extraction accuracy.