Clustering documents in a web directory

Authors:
Giordano Adami;Paolo Avesani;Diego Sona
Affiliations:
ITC-irst, Povo, Italy;ITC-irst, Povo, Italy;ITC-irst, Povo, Italy
Venue:
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Year:
2003

Citing 17
Cited 10

Hierarchical mixtures of experts and the EM algorithm

Neural Computation
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Self-organizing maps

Self-organizing maps
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
On the merits of building categorization systems by supervised clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Hierarchical Text Categorization Using Neural Networks

Information Retrieval
Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Machine Learning
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Hierarchical Text Classification and Evaluation

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Building Hierarchical Classifiers Using Class Proximity

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Hierarchical classification of HTML documents with WebClassII

ECIR'03 Proceedings of the 25th European conference on IR research
Self organization of a massive document collection

IEEE Transactions on Neural Networks

Clustering documents into a web directory for bootstrapping a supervised classification

Data & Knowledge Engineering - Special issue: WIDM 2003
Integrating recommendation models for improved web page prediction accuracy

ACSC '08 Proceedings of the thirty-first Australasian conference on Computer science - Volume 74
Automatic Indexing from a Thesaurus Using Bayesian Networks: Application to the Classification of Parliamentary Initiatives

ECSQARU '07 Proceedings of the 9th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty
Bayesian network models for hierarchical text classification from a thesaurus

International Journal of Approximate Reasoning
Linking Wikipedia entries to blog feeds by machine learning

Proceedings of the 3rd International Universal Communication Symposium
Encoding classifications into lightweight ontologies

Journal on data semantics VIII
An integrated model for next page access prediction

International Journal of Knowledge and Web Intelligence
Automatic maintenance of web directories by mining web browsing data

Journal of Web Engineering
Encoding classifications into lightweight ontologies

ESWC'06 Proceedings of the 3rd European conference on The Semantic Web: research and applications
Helping physicians to organize guidelines within conceptual hierarchies

AIME'05 Proceedings of the 10th conference on Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hierarchical categorization of documents is a task receiving growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hierarchy with a proper set of labeled examples is a critical issue. In this paper, we propose some solutions for the bootstrapping problem, implicitly or explicitly using a taxonomy definition: a baseline approach where documents are classified according to class labels, and two clustering approaches, where training is constrained by the a-priori knowledge of the taxonomy structure, both at terminological and topological level. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google Web directory.