Category cluster discovery from distributed WWW directories

Authors:
Mei-Ling Shyu;Choochart Haruechaiyasak;Shu-Ching Chen
Affiliations:
Department of Electrical and Computer Engineering, University of Miami, P.O. Box 248294, Coral Gables, FL;Department of Electrical and Computer Engineering, University of Miami, P.O. Box 248294, Coral Gables, FL;Distributed Multimedia Information System, Laboratory School of Computer Science, Florida International University, Miami, FL
Venue:
Information Sciences—Informatics and Computer Science: An International Journal - special issue: Knowledge discovery from distributed information sources
Year:
2003

Citing 16
Cited 8

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Clustering algorithms

Information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A vector space model for automatic indexing

Communications of the ACM
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A survey of web caching schemes for the Internet

ACM SIGCOMM Computer Communication Review
Cluster validity methods: part I

ACM SIGMOD Record
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Web Document Classification Based on Fuzzy Association

COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Web Mining: Information and Pattern Discovery on the World Wide Web

ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Mining longest repeating subsequences to predict world wide web surfing

USITS'99 Proceedings of the 2nd conference on USENIX Symposium on Internet Technologies and Systems - Volume 2

Architecture design of grid GIS and its applications on image processing based on LAN

Information Sciences—Informatics and Computer Science: An International Journal
Temporal analysis of clusters of supermarket customers: conventional versus interval set approach

Information Sciences—Informatics and Computer Science: An International Journal
Mining web browsing patterns for E-commerce

Computers in Industry
A web-page recommender system via a data mining framework and the Semantic Web concept

International Journal of Computer Applications in Technology
System design and implementation of digital-image processing using computational grids

Computers & Geosciences
Temporal analysis of clusters of supermarket customers: conventional versus interval set approach

Information Sciences: an International Journal
Rank order-based recommendation approach for multiple featured products

Expert Systems with Applications: An International Journal
Learning latent variable models from distributed and abstracted data

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the inherently distributed nature of many networks, including the Internet, information and knowledge are generated and organized independently by different groups of people. To discover and exploit all the knowledge from different sources, a method of knowledge integration is usually required. Considering the document category sets as information sources, we define a problem of information integration called category merging. The purpose of category merging is to automatically construct a unified category set which represents and exploits document information from several different sources. This merging process is based on the clustering concept where categories with similar characteristics are merged into the same cluster under certain distributed constraints. To evaluate the quality of the merged category set, we measure the precision and recall values under three classification methods, Naive Bayes, Vector Space Model, and K-Nearest Neighbor. In addition, we propose a performance measure called cluster entropy, which determines how well the categories from different sources are distributed over the resulting clusters. We perform the merging process by using the real data sets collected from three different Web directories. The results show that our merging process improves the classification performance over the non-merged approach and also provides a better representation for all categories from distributed directories.