Stylistic and lexical co-training for web block classification

  • Authors:
  • Chee How Lee;Min-Yen Kan;Sandra Lai

  • Affiliations:
  • National University of Singapore;National University of Singapore;National University of Singapore

  • Venue:
  • Proceedings of the 6th annual ACM international workshop on Web information and data management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many applications which use web data extract information from a limited number of regions on a web page. As such, web page division into blocks and the subsequent block classification have become a preprocessing step. We introduce PARCELS, an open-source, co-trained approach that performs classification based on separate stylistic and lexical views of the web page. Unlike previous work, PARCELS performs classification on fine-grained blocks. In addition to table-based layout, the system handles real-world pages which feature layout based on divisions and spans as well as stylistic inference for pages using cascaded style sheets. Our evaluation shows that the co-training process results in a reduction of 28.5% in error rate over a single-view classifier and that our approach is comparable to other state-of-the-art systems.