Preferential text classification: learning algorithms and evaluation measures

Authors:
Fabio Aiolli;Riccardo Cardin;Fabrizio Sebastiani;Alessandro Sperduti
Affiliations:
Dipartimento di Matematica Pura e Applicata, Università di Padova, Padova, Italy 63-35121;Dipartimento di Matematica Pura e Applicata, Università di Padova, Padova, Italy 63-35121;Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy 1-56124;Dipartimento di Matematica Pura e Applicata, Università di Padova, Padova, Italy 63-35121
Venue:
Information Retrieval
Year:
2009

Citing 27
Cited 0

OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Using a generalized instance set for automatic text categorization

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A new family of online algorithms for category ranking

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical Text Categorization Using Neural Networks

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A scalability analysis of classifiers in text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection

ISUMA '03 Proceedings of the 4th International Symposium on Uncertainty Modelling and Analysis
Bayes point machines

The Journal of Machine Learning Research
On the algorithmic implementation of multiclass kernel-based vector machines

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Automated categorization in the international patent classification

ACM SIGIR Forum
Support vector machine learning for interdependent and structured output spaces

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Hierarchical document categorization with support vector machines

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Comparing and aggregating rankings with ties

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A Preference Model for Structured Supervised Learning Tasks

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Comparing Partial Rankings

SIAM Journal on Discrete Mathematics
Support Vector Ordinal Regression

Neural Computation
Step Size Adaptation in Reproducing Kernel Hilbert Space

The Journal of Machine Learning Research
Kernel-Based Learning of Hierarchical Multilabel Classification Models

The Journal of Machine Learning Research
Feature selection for ranking

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting known taxonomies in learning overlapping concepts

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form "for document d i , category c驴 is preferred to category c驴驴"; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.