Evaluating text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
Original Contribution: Stacked generalization
Neural Networks
Machine Learning
Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
An adaptive version of the boost by majority algorithm
COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Hierarchical classification of Web content
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Heterogeneous Learner for Web Page Classification
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Mining concept-drifting data streams using ensemble classifiers
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Restrictive clustering and metaclustering for self-organizing document collections
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Goal-oriented methods and meta methods for document classification and their parameter tuning
Proceedings of the thirteenth ACM international conference on Information and knowledge management
The weighted majority algorithm
SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Meta methods for model sharing in personal information systems
ACM Transactions on Information Systems (TOIS)
Cost-sensitive three-way email spam filtering
Journal of Intelligent Information Systems
Hi-index | 0.00 |
This paper addresses the problem of performing supervised classification on document collections containing also junk documents. With ”junk documents” we mean documents that do not belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents.