Elements of information theory
Elements of information theory
Threading electronic mail: a preliminary study
Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
Distributional clustering of words for text classification
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Defining Knowledge Layers for Textual Case-Based Reasoning
EWCBR '98 Proceedings of the 4th European Workshop on Advances in Case-Based Reasoning
Genetic Algorithms to Optimise CBR Retrieval
EWCBR '00 Proceedings of the 5th European Workshop on Advances in Case-Based Reasoning
Building Compact Competent Case-Bases
ICCBR '99 Proceedings of the Third International Conference on Case-Based Reasoning and Development
The Role of Information Extraction for Textual CBR
ICCBR '01 Proceedings of the 4th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Distributional clustering of English words
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Categorization and Keyword Identification of Unlabeled Documents
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
The Knowledge Engineering Review
Sophia: a novel approach for textual case-based reasoning
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Index driven selective sampling for CBR
ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
A propositional approach to textual case indexing
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Generating estimates of classification confidence for a case-based spam filter
ICCBR'05 Proceedings of the 6th international conference on Case-Based Reasoning Research and Development
Acquiring Word Similarities with Higher Order Association Mining
ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
From variable weighting to cluster characterization in topographic unsupervised learning
IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Document clustering using synthetic cluster prototypes
Data & Knowledge Engineering
Recognition of higher-order relations among features in textual cases using random indexing
ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Extending CBR with multiple knowledge sources from web
ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Taxonomic semantic indexing for textual case-based reasoning
ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Hi-index | 0.00 |
Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: Cluster divides the entire feature space, before then selecting one feature to represent each cluster; and Greedy increments the feature subset size by a greedily selected feature. In particular we found that Greedy's local search is suited to learning smaller feature subset sizes while Cluster is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both Greedy and Cluster make significant progress towards the upper bound performance set by a standard supervised feature selection method.