Web Documents Categorization Using Fuzzy Representation and HAC

Authors:
Jiawei Deng;Lihui Chen
Affiliations:
-;-
Venue:
WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 2 - Volume 2
Year:
2000

Citing 9
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Categorizing Web pages on the subject of neural networks

Journal of Network and Computer Applications
Semi-supervised hierarchical clustering algorithms

SCAI '97 Proceedings of the sixth Scandinavian conference on Artificial intelligence
Constructing, organizing, and visualizing collections of topically related Web resources

ACM Transactions on Computer-Human Interaction (TOCHI)
Information Retrieval

Information Retrieval
Allocating Data Objects to Multiple Sites for Fast Browsing of Hypermedia Documents

COMPSAC '98 Proceedings of the 22nd International Computer Software and Applications Conference
Modeling and Querying Structure and Contents of the Web

DEXA '99 Proceedings of the 10th International Workshop on Database & Expert Systems Applications
VISVIP: 3D Visualization of Paths through Web Sites

DEXA '99 Proceedings of the 10th International Workshop on Database & Expert Systems Applications
Integration of Semistructured Data with Partial and Inconsistent Information

IDEAS '99 Proceedings of the 1999 International Symposium on Database Engineering & Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of the existing techniques for characterization of Web documents are based on term-frequency analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. However, as Web documents written in HTML are semi-structured documents by means of tags, the traditional techniques that assign term weights only by the frequency of occurrence may not be able to provide satisfactory results in representing the contents of such documents. Some recent studies have shown that the fuzzy representation (FR) of WWW information based on significance of HTML tag is an effective alternative for characterizing Web documents. In this paper, the FR to generate the feature vector for each Web document and the Hierarchical Agglomerative Clustering (HAC) algorithm are applied to investigate the efficiency and effectiveness for automatic categorization of Web documents with similar contents. Experiments conducted suggest several benefits of using such an approach.