Fast supervised feature extraction by term discrimination information pooling

  • Authors:
  • Amara Tariq;Asim Karim

  • Affiliations:
  • LUMS School of Science and Engineering, Lahore, Pakistan;LUMS School of Science and Engineering, Lahore, Pakistan

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dimensionality reduction (DR) through feature extraction (FE) is desirable for efficient and effective processing of text documents. Many of the techniques for text FE produce features that are not readily interpretable and require super-linear computation time. In this paper, we present a fast supervised DR/FE technique, named FEDIP, that is motivated by the notion of relatedness of terms to topics or contexts. This relatedness is quantified by using the discrimination information provided by a term for a topic in a labeled document collection. Features are constructed by pooling the discrimination information of highly related terms for each topic. FEDIP's time complexity is linear in the size of the vocabulary and document collection. FEDIP is evaluated for document classification with SVM and naive Bayes classifiers on six text data sets. The results show that FEDIP produces low-dimension feature spaces that yield higher classification accuracy when compared with LDA and LSI. FEDIP is also found to be significantly faster than the other techniques on our evaluation data sets.