Classifying web documents in a hierarchy of categories: a comprehensive study

Authors:
Michelangelo Ceci;Donato Malerba
Affiliations:
Dipartimento di Informatica, Universita degli Studi di Bari, Bari, Italy 70126;Dipartimento di Informatica, Universita degli Studi di Bari, Bari, Italy 70126
Venue:
Journal of Intelligent Information Systems
Year:
2007

Citing 38
Cited 11

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
The nature of statistical learning theory

The nature of statistical learning theory
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Classification by pairwise coupling

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Exploiting Hierarchy in Text Categorization

Information Retrieval
Hierarchical Text Categorization Using Neural Networks

Information Retrieval
A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

Journal of Intelligent Information Systems
Feature selection on hierarchy of web documents

Decision Support Systems - Web retrieval and mining
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Feature Subset Selection in Text-Learning

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Hierarchical Text Classification and Evaluation

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Mining HTML Pages to Support Document Sharing in a Cooperative System

EDBT '02 Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers
Effective Methods for Improving Naive Bayes Text Classifiers

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection

ISUMA '03 Proceedings of the 4th International Symposium on Uncertainty Modelling and Analysis
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
Effect of term distributions on centroid-based text categorization

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Multi-dimensional text classification

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Hierarchical classification of HTML documents with WebClassII

ECIR'03 Proceedings of the 25th European conference on IR research

Boosting multi-label hierarchical text categorization

Information Retrieval
Web-scale named entity recognition

Proceedings of the 17th ACM conference on Information and knowledge management
An extensive study on automated Dewey Decimal Classification

Journal of the American Society for Information Science and Technology
An overview of AI research in Italy

Artificial intelligence
Time-Slice Density Estimation for Semantic-Based Tourist Destination Suggestion

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
Transductive learning from textual data with relevant example selection

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
A survey of hierarchical classification across different application domains

Data Mining and Knowledge Discovery
A comparative experimental assessment of a threshold selection algorithm in hierarchical text categorization

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
A comparative study of thresholding strategies in progressive filtering

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
Active learning for hierarchical text classification

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Hierarchical classification of web documents by stratified discriminant analysis

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.