Hierarchical classification of web documents by stratified discriminant analysis

Authors:
Juan Carlos Gomez;Marie-Francine Moens
Affiliations:
Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium;Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium
Venue:
IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Year:
2012

Citing 27
Cited 0

Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Optimizing search by showing results in context

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
Discriminative Features for Document Classification

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 1 - Volume 1
Hierarchical document categorization with support vector machines

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Dimension Reduction in Text Classification with Support Vector Machines

The Journal of Machine Learning Research
A comprehensive comparative study on term weighting schemes for text categorization with support vector machines

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Support vector machines classification with a very large-scale taxonomy

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
Regularized discriminant analysis for high dimensional, low sample size data

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Classifying web documents in a hierarchy of categories: a comprehensive study

Journal of Intelligent Information Systems
Incremental Algorithms for Hierarchical Classification

The Journal of Machine Learning Research
Mixtures of hierarchical topics with Pachinko allocation

Proceedings of the 24th international conference on Machine learning
SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis

IEEE Transactions on Knowledge and Data Engineering
On applying linear discriminant analysis for multi-labeled problems

Pattern Recognition Letters
Deep classification in large-scale text hierarchies

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Text categorization via generalized discriminant analysis

Information Processing and Management: an International Journal
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Refined experts: improving classification in large taxonomies

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Selecting negative examples for hierarchical text classification: An experimental comparison

Journal of the American Society for Information Science and Technology
A survey of hierarchical classification across different application domains

Data Mining and Knowledge Discovery
Generalizing discriminant analysis using the generalized singular value decomposition

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work we present and evaluate a methodology to classify web documents into a predefined hierarchy using the textual content of the documents. The general problem of hierarchical classification using taxonomies with thousands of categories is a hard task due to the problem of scarcity of training data. Hierarchical classification is one of the rare situations where, despite the large amount of available data, as more documents become available, more classes are also added to the hierarchy. This leads to a lack of training data for most of the categories, which produces poor individual classification models and tends to bias the classification to dense categories. Here we propose a novel feature extraction technique called Stratified Discriminant Analysis (sDA) that reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy. The sDA model is intended to reduce the effects of scarcity of data by better grouping and identify the categories with few training examples leading to more robust classification models for those categories. The results of classifying web pages from the Kids&Teens branch of the DMOZ directory show that our model extracts features that are well suited for category grouping of web pages and representation of categories with few training examples.