Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models

Authors:
Majid Yazdani;Milad Eftekhar;Hassan Abolhassani
Affiliations:
Web Intelligence Laboratory, Computer Engineering Department, Sharif University of Technology, Tehran, Iran;Web Intelligence Laboratory, Computer Engineering Department, Sharif University of Technology, Tehran, Iran;Web Intelligence Laboratory, Computer Engineering Department, Sharif University of Technology, Tehran, Iran
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 6
Cited 0

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Text categorization for multi-page documents: a hybrid naive Bayes HMM approach

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

One important problem proposed recently in the field of web mining is website classification problem. The complexity together with the necessity to have accurate and fast algorithms yield to many attempts in this field, but there is a long way to solve these problems efficiently, yet. The importance of the problem encouraged us to work on a new approach as a solution. We use the content of web pages together with the link structure between them to improve the accuracy of results. In this work we use Naïve-bayes models for each predefined webpage class and an extended version of Hidden Markov Model is used as website class models. A few sample websites are adopted as seeds to calculate models' parameters. For classifying the websites we represent them with tree structures and we modify the Viterbi algorithm to evaluate the probability of generating these tree structures by every website model. Because of the large amount of pages in a website, we use a sampling technique that not only reduces the running time of the algorithm but also improves the accuracy of the classification process. At the end of this paper, we provide some experimental results which show the performance of our algorithm compared to the previous ones.