Evaluating text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
The nature of statistical learning theory
The nature of statistical learning theory
Machine Learning
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Data mining: concepts and techniques
Data mining: concepts and techniques
Modern Information Retrieval
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
BINGO!: Bookmark-Induced Gathering of Information
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Using Linear Algebra for Intelligent Information Retrieval
Using Linear Algebra for Intelligent Information Retrieval
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
Document Clustering Using Locality Preserving Indexing
IEEE Transactions on Knowledge and Data Engineering
Document clustering with prior knowledge
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The database research group at the Max-Planck Institute for Informatics
ACM SIGMOD Record
BordaConsensus: a new consensus function for soft cluster ensembles
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Meta methods for model sharing in personal information systems
ACM Transactions on Information Systems (TOIS)
Comparing Non-parametric Ensemble Methods for Document Clustering
NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Dual fuzzy-possibilistic coclustering for categorization of documents
IEEE Transactions on Fuzzy Systems
Multilingual spectral clustering using document similarity propagation
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Text clustering on latent thematic spaces: variants, strengths and weaknesses
ICA'07 Proceedings of the 7th international conference on Independent component analysis and signal separation
Content redundancy in YouTube and its application to video tagging
ACM Transactions on Information Systems (TOIS)
Using restrictive classification and meta classification for junk elimination
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Automatic document organization in a p2p environment
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Hi-index | 0.00 |
This paper addresses the problem of automatically structuring heterogenous document collections by using clustering methods. In contrast to traditional clustering, we study restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate clusters with low confidence. These techniques result in higher cluster purity, better overall accuracy, and make unsupervised self-organization more robust. Our comprehensive experimental studies on three different real-world data collections demonstrate these benefits. The proposed methods seem particularly suitable for automatically substructuring personal email folders or personal Web directories that are populated by focused crawlers, and they can be combined with supervised classification techniques.