OLERA: Semisupervised Web-Data Extraction with Visual Support

  • Authors:
  • Chia-Hui Chang;Shih-Chien Kuo

  • Affiliations:
  • National Central University, Taiwan;Trend Micro, Taiwan

  • Venue:
  • IEEE Intelligent Systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Extracting information from semistructured Web documents is an important task for many information agents. Over the past few years, researchers have developed an extensive family of generic information extraction techniques based on supervised approaches that learn extraction rules from user-labeled training examples. However, annotating training data can be expensive when thousands of data sources must be wrapped. OLERA, a semisupervised IE system, produces extraction rules without detailed annotation of the training documents. Instead, it gives a rough segment that contains all that need to be extracted in one record as an example. OLERA is designed with visualization support such that it displays the discovered records in a spreadsheet-like table for schema assignment. Experiments show that OLERA performs well for program-generated Web pages with very few training pages and little user intervention.