Large-scale hierarchical text classification without labelled data

Authors:
Viet Ha-Thuc;Jean-Michel Renders
Affiliations:
The University of Iowa, Iowa City, IA, USA;Xerox Research Centre Europe, Meylan, France
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 20
Cited 3

Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Text classification in a hierarchical mixture model for small training sets

Proceedings of the tenth international conference on Information and knowledge management
A Hierarchical Model for Clustering and Categorising Documents

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Hierarchical Text Classification and Evaluation

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Document classification through interactive supervision of document and term labels

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Hierarchical Dirichlet model for document classification

ICML '05 Proceedings of the 22nd international conference on Machine learning
Automatic expansion of domain-specific lexicons by term categorization

ACM Transactions on Speech and Language Processing (TSLP)
Constructing informative prior distributions from domain knowledge in text classification

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Web-based text classification in the absence of manually labeled training documents

Journal of the American Society for Information Science and Technology
Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Mixtures of hierarchical topics with Pachinko allocation

Proceedings of the 24th international conference on Machine learning
An unsupervised hierarchical approach to document categorization

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Knowledge Supervised Text Classification with No Labeled Documents

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Improving text categorization bootstrapping via unsupervised learning

ACM Transactions on Speech and Language Processing (TSLP)
Towards a Universal Text Classifier: Transfer Learning Using Encyclopedic Knowledge

ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops
Semi-supervised document classification with a mislabeling error model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
News Event Modeling and Tracking in the Social Web with Ontological Guidance

ICSC '10 Proceedings of the 2010 IEEE Fourth International Conference on Semantic Computing
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

POWDER and the multi million-triple store

Proceedings of the International Workshop on Semantic Web Information Management
Classifying unlabeled short texts using a fuzzy declarative approach

Language Resources and Evaluation
Structured summarization for news events

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

The traditional machine learning approaches for text classification often require labelled data for learning classifiers. However, when applied to large-scale classification involving thousands of categories, creating such labelled data is extremely expensive since typically the data is manually labelled by humans. Motivated by this, we propose a novel approach for large-scale hierarchical text classification which does not require any labelled data. We explore a perspective where the meaning of a category is not defined by human-labelled documents, but by its description and more importantly its relationships with other categories (e.g. its ascendants and descendants). Specifically, we take advantage of the ontological knowledge in all phases of the whole process, namely when retrieving pseudo-labelled documents, when iteratively training the category models and when categorizing test documents. Our experiments based on a taxonomy containing 1131 categories and widely adopted in the news industry as a standard for the NewsML framework demonstrate the effectiveness of our approach in these phases both qualitatively and quantitatively. In particular, we emphasize that just by taking the simple ontological knowledge defined in the category hierarchy, we could automatically build a large-scale hierarchical classifier with reasonable performance of 67% in terms of the hierarchy-based F-1 measure.