Finding and classifying web units in websites

Authors:
Aixin Sun;Ee-Peng Lim
Affiliations:
School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, 639798, Singapore.;School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, 639798, Singapore
Venue:
International Journal of Business Intelligence and Data Mining
Year:
2005

Citing 19
Cited 0

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV

Advances in kernel methods
Making large-scale support vector machine learning practical

Advances in kernel methods
Improved classification via connectivity information

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Relational learning with statistical predicate invention: better models for hypertext

Machine Learning - Special issue on inducive logic programming
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Reasoning for web document associations and its applications in site map construction

Data & Knowledge Engineering
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Composite Kernels for Hypertext Categorisation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Building a web thesaurus from web link structure

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In web classification, most researchers assume that the objects to be classified are individual web pages from one or more websites. In practice, the assumption is too restrictive since a web page itself may not carry sufficient information for it to be treated as an instance of some semantic class or concept. In this paper, we relax this assumption and allow a subgraph of web pages to represent an instance of the semantic concept. Such a subgraph of web pages is known as a web unit. To construct and classify web units, we formulate the web unit mining problem and propose an iterative web unit mining (iWUM) method. The iWUM method first finds subgraphs of web pages using knowledge about website structure and connectivity among the web pages. From these web subgraphs, web units are constructed and classified into categories in an iterative manner. Our experiments using the WebKB dataset showed that iWUM was able to construct web units and classify web units with high accuracy for the more structured parts of a website.