CCWrapper: adaptive predefined schema guided web extraction

  • Authors:
  • Jun Gao;Dongqing Yang;Tengjiao Wang

  • Affiliations:
  • Department of Computer Science and Technology, Peking University, Beijing, China;Department of Computer Science and Technology, Peking University, Beijing, China;Department of Computer Science and Technology, Peking University, Beijing, China

  • Venue:
  • WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a method called CCWrapper (Classification-Cluster) to extract target data items from web pages under the guide of the predefined schema. CCWrapper extracts and combines the different HTML nodes features, including the style, structure, thesaurus and data type attributes into one unified model, and generates the extraction rules with Bayes classification in the training step. When the new HTML page is handled, CCWrapper generates the probability of the target element for each HTML node and clusters the HTML nodes for extraction based on the intra-document relationship in the HTML document tree. The preliminary experimental results on real-life web sites demonstrate CCWrapper is a promising extraction method.