Learning outliers to refine a corpus for chinese webpage categorization

Authors:
Dingsheng Luo;Xinhao Wang;Xihong Wu;Huisheng Chi
Affiliations:
National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China;National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China;National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China;National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, China
Venue:
ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part I
Year:
2005

Citing 13
Cited 1

The Strength of Weak Learnability

Machine Learning
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Soft Margins for AdaBoost

Machine Learning
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
An Adaptive Version of the Boost by Majority Algorithm

Machine Learning
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
On Machine Learning Methods for Chinese Document Categorization

Applied Intelligence
Chinese lexical analysis using hierarchical hidden Markov model

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Reducing the need for double annotation

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well.