Hybrid DIAAF/RS: statistical textual feature selection for language-independent text classification

  • Authors:
  • Yanbo J. Wang;Fan Li;Frans Coenen;Robert Sanderson;Qin Xin

  • Affiliations:
  • Information Management Center, China Minsheng Banking Corp., Ltd., Beijing, China;Information Management Center, China Minsheng Banking Corp., Ltd., Beijing, China;Department of Computer Science, University of Liverpool, Liverpool, UK;Los Alamos National Laboratory, Los Alamos, New Mexico;Simula Research Laboratory, Oslo, Norway

  • Venue:
  • ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Textual Feature Selection (TFS) is an important phase in the process of text classification. It aims to identify the most significant textual features (i.e. key words and/or phrases), in a textual dataset, that serve to distinguish between text categories. In TFS, basic techniques can be divided into two groups: linguistic vs. statistical. For the purpose of building a language-independent text classifier, the study reported here is concerned with statistical TFS only. In this paper, we propose a novel statistical TFS approach that hybridizes the ideas of two existing techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and RS (Relevancy Score). With respect to associative (text) classification, the experimental results demonstrate that the proposed approach can produce greater classification accuracy than other alternative approaches.