Web document clustering using hyperlink structures

Authors:
Xiaofeng He;Hongyuan Zha;Chris H.Q. Ding;Horst D. Simon
Affiliations:
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA and NERSC Division, Lawrence Berkeley National Laboratory, University of Californi ...;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA;NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA;NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA
Venue:
Computational Statistics & Data Analysis
Year:
2002

Citing 21
Cited 25

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Partitioning sparse matrices with eigenvectors of graphs

SIAM Journal on Matrix Analysis and Applications
Laplace eigenvalues of graphs—a survey

Discrete Mathematics - Algebraic graph theory; a volume dedicated to Gert Sabidussi
A user-centred evaluation of ranking algorithms for interactive query expansion

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Adapting a full-text information retrieval system to the computer troubleshooting domain

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
An improved spectral graph partitioning algorithm for mapping parallel computations

SIAM Journal on Scientific Computing
Silk from a sow's ear: extracting usable structures from the Web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Toward a Qualitative Search Engine

IEEE Internet Computing
Mining the Web's Link Structure

Computer
Extracting Large-Scale Knowledge Bases from the Web

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Normalized Cuts and Image Segmentation

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Spectral partitioning works: planar graphs and finite element meshes

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Providing Government Information on the Interne: Experiences with THOMAS

Providing Government Information on the Interne: Experiences with THOMAS
The web as a graph: measurements, models, and methods

COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics

Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Multitype Features Coselection for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
The web structure of e-government - developing a methodology for quantitative evaluation

Proceedings of the 15th international conference on World Wide Web
Combining content and link for classification using matrix factorization

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
User-assisted similarity estimation for searching related web pages

Proceedings of the eighteenth conference on Hypertext and hypermedia
Learning multiple graphs for document recommendations

Proceedings of the 17th international conference on World Wide Web
An Approximate Distribution for the Normalized Cut

Journal of Mathematical Imaging and Vision
A Graph Clustering Algorithm Based on Minimum and Normalized Cut

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Finding topic trends in digital libraries

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Web page clustering using heuristic search in the web graph

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Attaining higher quality for density based algorithms

RR'07 Proceedings of the 1st international conference on Web reasoning and rule systems
A fuzzy bi-clustering approach to correlate web users and pages

International Journal of Knowledge and Web Intelligence
TRACEMIN-Fiedler: a parallel algorithm for computing the Fiedler vector

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Costco: robust content and structure constrained clustering of networked documents

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
A unified representation of web logs for mining applications

Information Retrieval
Hybrid clustering of multi-view data via Tucker-2 model and its application

Scientometrics
Clustering scientific literature using sparse citation graph analysis

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Improving semantic consistency of web sites by quantifying user intent

ICWE'05 Proceedings of the 5th international conference on Web Engineering
Local clustering of large graphs by approximate fiedler vectors

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
State aggregation in higher order markov chains for finding online communities

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques

Proceedings of the 21st international conference companion on World Wide Web
Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping

Scientometrics
Tensor Framework and Combined Symmetry for Hypertext Mining

Fundamenta Informaticae
User community discovery from multi-relational networks

Decision Support Systems

Quantified Score

Hi-index	0.03

Visualization

Abstract

With the exponential growth of information on the World Wide Web, there is great demand for developing efficient methods for effectively organizing the large amount of retrieved information. Document clustering plays an important role in information retrieval and taxonomy management for the Web. In this paper we examine three clustering methods: K-means, multi-level METIS, and the recently developed normalized-cut method using a new approach of combining textual information, hyperlink structure and co-citation relations into a single similarity metric. We found the normalized-cut method with the new similarity metric is particularly effective, as demonstrated on three datasets of web query results. We also explore some theoretical connections between the normalized-cut method and the K-means method.