Restrictive clustering and metaclustering for self-organizing document collections

Authors:
Stefan Siersdorfer;Sergej Sizov
Affiliations:
Max Planck Institut fuer Informatik, Saarbruecken, Germany;Max Planck Institut fuer Informatik, Saarbruecken, Germany
Venue:
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2004

Citing 12
Cited 12

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
The nature of statistical learning theory

The nature of statistical learning theory
Bagging predictors

Machine Learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Data mining: concepts and techniques

Data mining: concepts and techniques
Modern Information Retrieval

Modern Information Retrieval
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
BINGO!: Bookmark-Induced Gathering of Information

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Using Linear Algebra for Intelligent Information Retrieval

Using Linear Algebra for Intelligent Information Retrieval
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research

Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
Document clustering with prior knowledge

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The database research group at the Max-Planck Institute for Informatics

ACM SIGMOD Record
BordaConsensus: a new consensus function for soft cluster ensembles

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Meta methods for model sharing in personal information systems

ACM Transactions on Information Systems (TOIS)
Comparing Non-parametric Ensemble Methods for Document Clustering

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Dual fuzzy-possibilistic coclustering for categorization of documents

IEEE Transactions on Fuzzy Systems
Multilingual spectral clustering using document similarity propagation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Text clustering on latent thematic spaces: variants, strengths and weaknesses

ICA'07 Proceedings of the 7th international conference on Independent component analysis and signal separation
Content redundancy in YouTube and its application to video tagging

ACM Transactions on Information Systems (TOIS)
Using restrictive classification and meta classification for junk elimination

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Automatic document organization in a p2p environment

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of automatically structuring heterogenous document collections by using clustering methods. In contrast to traditional clustering, we study restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate clusters with low confidence. These techniques result in higher cluster purity, better overall accuracy, and make unsupervised self-organization more robust. Our comprehensive experimental studies on three different real-world data collections demonstrate these benefits. The proposed methods seem particularly suitable for automatically substructuring personal email folders or personal Web directories that are populated by focused crawlers, and they can be combined with supervised classification techniques.