Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
Constructing, organizing, and visualizing collections of topically related Web resources
ACM Transactions on Computer-Human Interaction (TOCHI)
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 10th international conference on World Wide Web
Enhanced topic distillation using text, markup tags, and hyperlinks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
A Study of Approaches to Hypertext Categorization
Journal of Intelligent Information Systems
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Classification of HTML Documents by Hidden Tree-Markov Models
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Link mining: a new data mining challenge
ACM SIGKDD Explorations Newsletter
Context in problem solving: a survey
The Knowledge Engineering Review
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Wavelet-based statistical signal processing using hidden Markovmodels
IEEE Transactions on Signal Processing
Bayesian tree-structured image modeling using wavelet-domain hidden Markov models
IEEE Transactions on Image Processing
Coarse-grained classification of web sites by their structural properties
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Web Intelligence and Agent Systems
Exploring local community structures in large networks
Web Intelligence and Agent Systems
Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A solution to the exact match on rare item searches: introducing the lost sheep algorithm
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
The extensive amount of diversified Web-based information necessitates the development of automated subject-specific Web site classification techniques. Given that Web sites are in essence heterogeneous, multi-structured and often accompanied with much noise, it is important to design Web site classification algorithms that can scale well in the context of noise and heterogeneity. In this paper, we propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree, i.e., each page is modeled as a DOM (Document Object Model) tree, and a page tree is used to hierarchically link all pages within the site. Two context models are formulated to characterize the topical dependences between nodes in the two-layered tree. Using the Hidden Markov Tree (HMT) as the statistical model of page trees and DOM trees, a two-phase Web site classification algorithm is presented. Moreover, for further improving accuracy while reducing the classification overheads, a two-stage denoising procedure is adopted to remove the noise information within sites, and an entropy-based strategy is introduced to dynamically prune the page trees. The experiments demonstrate that the proposed approach is able to offer high accuracy and efficient processing performance.