Improving density-based methods for hierarchical clustering of web pages

Authors:
Morteza Haghir Chehreghani;Hassan Abolhassani;Mostafa Haghir Chehreghani
Affiliations:
Faculty of CE, Sharif University of Technology, Tehran 1458889694, Iran;Faculty of CE, Sharif University of Technology, Tehran 1458889694, Iran and School of Computer Science, Institute for Studies in Theoretical Physics and Mathematics (IPM), Niavaran, Tehran 1458889 ...;Faculty of ECE, School of Engineering, University of Tehran, Tehran 1234561234, Iran
Venue:
Data & Knowledge Engineering
Year:
2008

Citing 32
Cited 2

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
C4.5: programs for machine learning

C4.5: programs for machine learning
HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

Proceedings of the the seventh ACM conference on Hypertext
Silk from a sow's ear: extracting usable structures from the Web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Life, death, and lawfulness on the electronic frontier

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The quest for correct information on the Web: hyper search engines

Selected papers from the sixth international conference on World Wide Web
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Data clustering: a review

ACM Computing Surveys (CSUR)
Swarm intelligence

Swarm intelligence
A clustering strategy based on a formalism of the reproductive process in natural systems

SIGIR '79 Proceedings of the 2nd annual international ACM SIGIR conference on Information storage and retrieval: information implications into the eighties
Introduction to Algorithms

Introduction to Algorithms
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Fast hierarchical clustering and its validation

Data & Knowledge Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
On Combining Link and Contents Information for Web Page Clustering

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Use Link-Based Clustering to Improve Web Search Results

WISE '01 Proceedings of the Second International Conference on Web Information Systems Engineering (WISE'01) Volume 1 - Volume 1
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Clustering documents into a web directory for bootstrapping a supervised classification

Data & Knowledge Engineering - Special issue: WIDM 2003
Indexed-based density biased sampling for clustering applications

Data & Knowledge Engineering
Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters

IEEE Transactions on Computers
DBRS: a density-based spatial clustering method with random sampling

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
AntClust: ant clustering and web usage mining

GECCO'03 Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartI
Web page clustering: a hyperlink-based similarity and matrix-based hierarchical algorithms

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Genetic K-means algorithm

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Hierarchical web-page clustering via in-page and cross-page link structures

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Data clustering using controlled consensus in complex networks

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The rapid increase of information on the web makes it necessary to improve information management techniques. One of the most important techniques is clustering web data. In this paper, we propose a new 3-phase clustering method that finds dense units in a data set using density-based algorithms. The distances in the dense units are stored in order in structures such as a min heap. In the extraction stage, these distances are extracted one by one, and their effects on the clustering process are examined. Finally, in the combination stage, clustering is completed using improved versions of well-known single and average linkage methods. All steps of the methods are performed in O(nlogn) time complexity. The proposed methods have the benefit of low complexity, and experimental results show they generate clusters with high quality. Other experiments also show that they provide additional advantages, such as clustering by sampling.