Learning outliers to refine a corpus for chinese webpage categorization

  • Authors:
  • Dingsheng Luo;Xinhao Wang;Xihong Wu;Huisheng Chi

  • Affiliations:
  • National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China;National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China;National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China;National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China

  • Venue:
  • ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part I
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well.