Classifying High-Dimensional Text and Web Data Using Very Short Patterns

  • Authors:
  • Hassan H. Malik;John R. Kender

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose the "Democratic Classifier", a simple pattern-based classification algorithm that uses very short patterns for classification, and does not rely on the minimum support threshold. Borrowing ideas from democracy, our training phase allows each training instance to vote for an equal number of candidate size-2 patterns. The training instances select patterns by effectively balancing between local, class, and global significance of patterns. The selected patterns are simultaneously added to the model for all applicable classes and a novel power law based weighing scheme adjusts their weights with respect of each class. Results of experiments performed on 121 common text and web datasets show that our algorithm almost always outperforms state of the art classification algorithms, without any parameter tuning. On 100 real-life web datasets, the average absolute classification accuracy improvement was as great as 9.4% over SVM, Harmony, C4.5 and KNN. Also, our algorithm ran about 3.5 times faster than the fastest existing pattern-based classification algorithm.