Text Document Categorization by Term Association

  • Authors:
  • Maria-Luiza Antonie;Osmar R. Zaïane

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

A good text classifier is a classifier that efficiently categorizeslarge sets of text documents in a reasonable timeframe and with an acceptable accuracy, and that providesclassification rules that are human readable for possiblefine-tuning. If the training of the classifier is also quick,this could become in some application domains a good assetfor the classifier. Many techniques and algorithms forautomatic text categorization have been devised. Accordingto published literature, some are more accurate than others,and some provide more interpretable classification modelsthan others. However, none can combine all the beneficialproperties enumerated above. In this paper, we present anovel approach for automatic text categorization that borrowsfrom market basket analysis techniques using associationrule mining in the data-mining field. We focus on twomajor problems: (1) finding the best term association rulesin a textual database by generating and pruning; and (2)using the rules to build a text classifier. Our text categorizationmethod proves to be efficient and effective, and experimentson well-known collections show that the classifierperforms well. In addition, training as well as classificationare both fast and the generated rules are human readable.