A probabilistic description-oriented approach for categorizing web documents

Authors:
Norbert Gövert;Mounia Lalmas;Norbert Fuhr
Affiliations:
University of Dortmund;Department of Computer Science, Queen Mary & Westfield College, University of London and University of Dortmund;University of Dortmund
Venue:
Proceedings of the eighth international conference on Information and knowledge management
Year:
1999

Citing 8
Cited 16

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Towards an information logic

SIGIR '89 Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval
Models for retrieval with probabilistic indexing

Information Processing and Management: an International Journal - Modeling data, information and knowledge
A probabilistic learning approach for document indexing

ACM Transactions on Information Systems (TOIS) - Special issue on research and development in information retrieval
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On modeling information retrieval with probabilistic inference

ACM Transactions on Information Systems (TOIS)
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval

Personal ontologies for web navigation

Proceedings of the ninth international conference on Information and knowledge management
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Automatic Topic Identification Using Ontology Hierarchy

CICLing '01 Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing
System of information retrieval in XML documents

Effective databases for text & document management
Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Ontology-based personalized search and browsing

Web Intelligence and Agent Systems
Combining structural and citation-based evidence for text classification

Proceedings of the thirteenth ACM international conference on Information and knowledge management
BDEI: Biodiversity Information Organization using Taxonomy (BIOT)

dg.o '02 Proceedings of the 2002 annual national conference on Digital government research
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
A study on optimal parameter tuning for Rocchio text classifier

ECIR'03 Proceedings of the 25th European conference on IR research
Classifying documents with link-based bibliometric measures

Information Retrieval
Ontology-based automatic classification of web documents

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
Topic selection of web documents using specific domain ontology

MICAI'06 Proceedings of the 5th Mexican international conference on Artificial Intelligence
An automatic approach to classify web documents using a domain ontology

PReMI'05 Proceedings of the First international conference on Pattern Recognition and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics and (2) using more effective classifiers.Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of the k-nearest neighbour classifier.Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.