Noise reduction through summarization for Web-page classification

  • Authors:
  • Dou Shen;Qiang Yang;Zheng Chen

  • Affiliations:
  • Department of Computer Science and Technology, Hong Kong University of Science and Technology, Hong Kong, PR China;Department of Computer Science and Technology, Hong Kong University of Science and Technology, Hong Kong, PR China;Microsoft Research Asia, Beijing, PR China

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to a large variety of noisy information embedded in Web pages, Web-page classification is much more difficult than pure-text classification. In this paper, we propose to improve the Web-page classification performance by removing the noise through summarization techniques. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then put forward a new Web-page summarization algorithm based on Web-page layout and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that the classification algorithms (NB or SVM) augmented by any summarization approach can achieve an improvement by more than 5.0% as compared to pure-text-based classification algorithms. We further introduce an ensemble method to combine the different summarization algorithms. The ensemble summarization method achieves more than 12.0% improvement over pure-text based methods.