An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Authors:
Franca Debole;Fabrizio Sebastiani
Affiliations:
Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1, 56124 Pisa, Italy;Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1, 56124 Pisa, Italy
Venue:
Journal of the American Society for Information Science and Technology
Year:
2005

Citing 29
Cited 28

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Representation and learning in information retrieval

Representation and learning in information retrieval
Improving text retrieval for the routing problem using latent semantic indexing

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text classification using ESC-based stochastic decision lists

Proceedings of the eighth international conference on Information and knowledge management
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
An improved boosting algorithm and its application to text categorization

Proceedings of the ninth international conference on Information and knowledge management
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A meta-learning approach for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text classification in a hierarchical mixture model for small training sets

Proceedings of the tenth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Bayesian online classifiers for text classification and filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A new family of online algorithms for category ranking

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic combination of text classifiers using reliability indicators: models and results

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories

IAAI '90 Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence
Using asymmetric distributions to improve text classifier probability estimates

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A maximal figure-of-merit learning approach to text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Discretizing continuous attributes in AdaBoost for text categorization

ECIR'03 Proceedings of the 25th European conference on IR research

Semi-supervised single-label text categorization using centroid-based classifiers

Proceedings of the 2007 ACM symposium on Applied computing
Towards a synthetic analysis of user's information need for more effective personalized filtering services

Proceedings of the 2007 ACM symposium on Applied computing
A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Evolving Lucene search queries for text classification

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Semantic mapping and K-means applied to hybrid SOM-based document organization system construction

Proceedings of the 2008 ACM symposium on Applied computing
A quickly trainable hybrid SOM-based document organization system

Neurocomputing
Using Wavelets to Classify Documents

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Incremental data-driven learning of a novelty detection model for one-class classification with application to high-dimensional noisy data

Machine Learning
Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Clinical text classification under the Open and Closed Topic Assumptions

International Journal of Data Mining and Bioinformatics
Immune Learning in a Dynamic Information Environment

ICARIS '09 Proceedings of the 8th International Conference on Artificial Immune Systems
Semi-supervised Text Classification Using RBF Networks

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
An effective and robust method for short text classification

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Document clustering using unsupervised learning method: topology-preserving map

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
On the relative hardness of clustering corpora

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Text categorization based on topic model

RSKT'08 Proceedings of the 3rd international conference on Rough sets and knowledge technology
A text categorization method based on local document frequency

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Analytical evaluation of term weighting schemes for text categorization

Pattern Recognition Letters
Semantic Space models for classification of consumer webpages on metadata attributes

Journal of Biomedical Informatics
Exploiting word cluster information for unsupervised feature selection

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
A multiclass/multilabel document categorization system: Combining multiple classifiers in a reduced dimension

Applied Soft Computing
A new nearest neighbor rule for text categorization

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Comparison of term frequency and document frequency based feature selection metrics in text categorization

Expert Systems with Applications: An International Journal
Using the absolute difference of term occurrence probabilities in binary text categorization

Applied Intelligence
On the assessment of text corpora

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
A term association translation model for naive bayes text classification

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Learning to classify service data with latent semantics

RSKT'12 Proceedings of the 7th international conference on Rough Sets and Knowledge Technology
Nonlinear transformation of term frequencies for term weighting in text categorization

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have “carved” different subsets out of this collection and tested their systems on one of these subsets only; systems that have been tested on different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets. © 2005 Wiley Periodicals, Inc.