A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization

  • Authors:
  • Kang Hyuk Lee;Judy Kay;Byeong Ho Kang;Uwe Rosebrock

  • Affiliations:
  • -;-;-;-

  • Venue:
  • PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Two main research areas in statistical text categorization are similarity- based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to improve performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 and 20-Newsgroups data sets. The experimental results show that our new approaches give better results for both micro-averaged F1 and macro-averaged F1 scores.