Innovating web page classification through reducing noise

  • Authors:
  • Li Xiaoli;Shi Zhongzhi

  • Affiliations:
  • Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, The Chinese Academy of Sciences, Beijing 100080, P.R. China and School of Computing, National University of ...;Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, The Chinese Academy of Sciences, Beijing 100080, P.R. China

  • Venue:
  • Journal of Computer Science and Technology
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a new method that eliminates noise in Web page classification. It first describes the presentation of a Web page based on HTML tags. Then through a novel distance formula, it eliminates the noise in similarity measure. After carefully analyzing Web pages, we design an algorithm that can distinguish related hyperlinks from noisy ones. We can utilize non-noisy hyperlinks to improve the performance of Web page classification (the CAWN algorithm). For any page, we can classify it through the text and category of neighbor pages related to the page. The experimental results show that our approach improved classification accuracy.