Unsupervised feature selection for text data

  • Authors:
  • Nirmalie Wiratunga;Rob Lothian;Stewart Massie

  • Affiliations:
  • School of Computing, The Robert Gordon University, Aberdeen, Scotland, UK;School of Computing, The Robert Gordon University, Aberdeen, Scotland, UK;School of Computing, The Robert Gordon University, Aberdeen, Scotland, UK

  • Venue:
  • ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: Cluster divides the entire feature space, before then selecting one feature to represent each cluster; and Greedy increments the feature subset size by a greedily selected feature. In particular we found that Greedy's local search is suited to learning smaller feature subset sizes while Cluster is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both Greedy and Cluster make significant progress towards the upper bound performance set by a standard supervised feature selection method.