Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence - Special issue on Intelligent internet systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Ensemble Methods in Machine Learning
MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
Classifiers without borders: incorporating fielded text from neighboring web pages
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Hypertext Classification Using Tensor Space Model and Rough Set Based Ensemble Classifier
PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence
Tensor Framework and Combined Symmetry for Hypertext Mining
Fundamenta Informaticae
A novel split and merge technique for hypertext classification
Transactions on rough sets XII
Introducing semantics in web personalization: the role of ontologies
EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining
Tensor Framework and Combined Symmetry for Hypertext Mining
Fundamenta Informaticae
Hi-index | 0.00 |
Previous work in hypertext classification has resulted in two principal approaches for incorporating information about the graph properties of the Web into the training of a classifier. The first approach uses the complete text of the neighboring pages, whereas the second approach uses only their class labels. In this paper, we argue that both approaches are unsatisfactory: the first one brings in too much irrelevant information, while the second approach is too coarse by abstracting the entire page into a single class label. We argue that one needs to focus on relevant parts of predecessor pages, namely on the region in the neighborhood of the origin of an incoming link. To this end, we will investigate different ways for extracting such features, and compare several different techniques for using them in a text classifier.