Unsupervised feature selection for text data

Authors:
Nirmalie Wiratunga;Rob Lothian;Stewart Massie
Affiliations:
School of Computing, The Robert Gordon University, Aberdeen, Scotland, UK;School of Computing, The Robert Gordon University, Aberdeen, Scotland, UK;School of Computing, The Robert Gordon University, Aberdeen, Scotland, UK
Venue:
ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
Year:
2006

Citing 17
Cited 7

Elements of information theory

Elements of information theory
Threading electronic mail: a preliminary study

Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Defining Knowledge Layers for Textual Case-Based Reasoning

EWCBR '98 Proceedings of the 4th European Workshop on Advances in Case-Based Reasoning
Genetic Algorithms to Optimise CBR Retrieval

EWCBR '00 Proceedings of the 5th European Workshop on Advances in Case-Based Reasoning
Building Compact Competent Case-Bases

ICCBR '99 Proceedings of the Third International Conference on Case-Based Reasoning and Development
The Role of Information Extraction for Textual CBR

ICCBR '01 Proceedings of the 4th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Categorization and Keyword Identification of Unlabeled Documents

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Textual case-based reasoning

The Knowledge Engineering Review
Sophia: a novel approach for textual case-based reasoning

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Index driven selective sampling for CBR

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
A propositional approach to textual case indexing

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Generating estimates of classification confidence for a case-based spam filter

ICCBR'05 Proceedings of the 6th international conference on Case-Based Reasoning Research and Development

Acquiring Word Similarities with Higher Order Association Mining

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
From Anomaly Reports to Cases

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
From variable weighting to cluster characterization in topographic unsupervised learning

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Document clustering using synthetic cluster prototypes

Data & Knowledge Engineering
Recognition of higher-order relations among features in textual cases using random indexing

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Extending CBR with multiple knowledge sources from web

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Taxonomic semantic indexing for textual case-based reasoning

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: Cluster divides the entire feature space, before then selecting one feature to represent each cluster; and Greedy increments the feature subset size by a greedily selected feature. In particular we found that Greedy's local search is suited to learning smaller feature subset sizes while Cluster is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both Greedy and Cluster make significant progress towards the upper bound performance set by a standard supervised feature selection method.