An Evaluation of Statistical Approaches to Text Categorization

  • Authors:
  • Yiming Yang

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3702, USA. yiming@cs.cmu.edu

  • Venue:
  • Information Retrieval
  • Year:
  • 1999

Quantified Score

Hi-index 0.03

Visualization

Abstract

This paper focuses on a comparative evaluation of a wide-range oftext categorization methods, including previously published results on theReuters corpus and new results of additional experiments. Acontrolled study using three classifiers, kNN, LLSF and WORD, wasconducted to examine the impact of configuration variations in fiveversions of Reuters on the observed performance of classifiers.Analysis and empirical evidence suggest that the evaluation results onsome versions of Reuters were significantly affected by the inclusionof a large portion of unlabelled documents, mading those resultsdifficult to interpret and leading to considerable confusions in theliterature. Using the results evaluated on the other versions ofReuters which exclude the unlabelled documents, the performance oftwelve methods are compared directly or indirectly. For indirectcompararions, kNN, LLSF and WORD were used as baselines, since theywere evaluated on all versions of Reuters that exclude the unlabelleddocuments. As a global observation, kNN, LLSF and a neural networkmethod had the best performance; except for a Naive Bayes approach,the other learning algorithms also performed relatively well.