Bringing order to the Web: automatically categorizing search results
Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Hierarchical classification of Web content
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Optimizing search by showing results in context
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
The VLDB Journal — The International Journal on Very Large Data Bases
Discriminative Features for Document Classification
ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 1 - Volume 1
Hierarchical document categorization with support vector machines
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Dimension Reduction in Text Classification with Support Vector Machines
The Journal of Machine Learning Research
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Support vector machines classification with a very large-scale taxonomy
ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Pachinko allocation: DAG-structured mixture models of topic correlations
ICML '06 Proceedings of the 23rd international conference on Machine learning
Regularized discriminant analysis for high dimensional, low sample size data
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Classifying web documents in a hierarchy of categories: a comprehensive study
Journal of Intelligent Information Systems
Incremental Algorithms for Hierarchical Classification
The Journal of Machine Learning Research
Mixtures of hierarchical topics with Pachinko allocation
Proceedings of the 24th international conference on Machine learning
SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis
IEEE Transactions on Knowledge and Data Engineering
On applying linear discriminant analysis for multi-labeled problems
Pattern Recognition Letters
Deep classification in large-scale text hierarchies
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Text categorization via generalized discriminant analysis
Information Processing and Management: an International Journal
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Refined experts: improving classification in large taxonomies
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Selecting negative examples for hierarchical text classification: An experimental comparison
Journal of the American Society for Information Science and Technology
A survey of hierarchical classification across different application domains
Data Mining and Knowledge Discovery
Generalizing discriminant analysis using the generalized singular value decomposition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Hi-index | 0.00 |
In this work we present and evaluate a methodology to classify web documents into a predefined hierarchy using the textual content of the documents. The general problem of hierarchical classification using taxonomies with thousands of categories is a hard task due to the problem of scarcity of training data. Hierarchical classification is one of the rare situations where, despite the large amount of available data, as more documents become available, more classes are also added to the hierarchy. This leads to a lack of training data for most of the categories, which produces poor individual classification models and tends to bias the classification to dense categories. Here we propose a novel feature extraction technique called Stratified Discriminant Analysis (sDA) that reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy. The sDA model is intended to reduce the effects of scarcity of data by better grouping and identify the categories with few training examples leading to more robust classification models for those categories. The results of classifying web pages from the Kids&Teens branch of the DMOZ directory show that our model extracts features that are well suited for category grouping of web pages and representation of categories with few training examples.