Leveraging Web 2.0 Sources for Web Content Classification

  • Authors:
  • Somnath Banerjee;Martin Scholz

  • Affiliations:
  • -;-

  • Venue:
  • WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses practical aspects of web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the web. We study techniques for building training corpora automatically from publicly available web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the web.