A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

Authors:
Alexei Vinokourov;Mark Girolami
Affiliations:
Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK. alexei@cs.rhul.ac.uk;School of Communication and Information Technologies, University of Paisley, High Street, Paisley, PA1 2BE, UK. mark.girolami@paisley.ac.uk
Venue:
Journal of Intelligent Information Systems
Year:
2002

Citing 27
Cited 7

Latent variable models and factors analysis

Latent variable models and factors analysis
Elements of information theory

Elements of information theory
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
Hierarchical mixtures of experts and the EM algorithm

Neural Computation
The nature of statistical learning theory

The nature of statistical learning theory
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Efficient Approximations for the MarginalLikelihood of Bayesian Networks with Hidden Variables

Machine Learning - Special issue on learning with probabilistic representations
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A Hierarchical Latent Variable Model for Data Visualization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
An Introduction to Variational Methods for Graphical Models

Machine Learning
Exploiting generative models in discriminative classifiers

Proceedings of the 1998 conference on Advances in neural information processing systems II
Learning mixture hierarchies

Proceedings of the 1998 conference on Advances in neural information processing systems II
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Information Retrieval

Information Retrieval
Exploiting Hierarchy in Text Categorization

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Maximum entropy discrimination

Maximum entropy discrimination
Learning with mixtures of trees

The Journal of Machine Learning Research
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Score and information for recursive exponential models with incomplete data

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence

A Hierarchical Model for Clustering and Categorising Documents

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Asymptotic properties of the Fisher kernel

Neural Computation
Classifying web documents in a hierarchy of categories: a comprehensive study

Journal of Intelligent Information Systems
Boosting multi-label hierarchical text categorization

Information Retrieval
PLSI: The True Fisher Kernel and beyond

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Revisiting fisher kernels for document similarities

ECML'06 Proceedings of the 17th European conference on Machine Learning
TreeBoost.MH: a boosting algorithm for multi-label hierarchical text categorization

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.