Clustering web documents using hierarchical representation with multi-granularity

Authors:
Faliang Huang;Shichao Zhang;Minghua He;Xindong Wu
Affiliations:
Faculty of Software, Fujian Normal University, Fuzhou, China 350007;College of Computer Science and IT, Guangxi Normal University, Guilin, China 541004 and Faculty of Engineering and Information Technology, UTS, Broadway, Australia 2007;Computer Science, Aston University, Birmingham, United Kingdom B4 7ET;Department of Computer Science, University of Vermont, Burlington, USA 05405
Venue:
World Wide Web
Year:
2014

Citing 26
Cited 0

Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic

Fuzzy Sets and Systems - Special issue: fuzzy sets: where do we stand? Where do we go?
A vector space model for automatic indexing

Communications of the ACM
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Semantic similarity methods in wordNet and their application to information retrieval on the web

Proceedings of the 7th annual ACM international workshop on Web information and data management
Text Representation: From Vector to Tensor

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
GDClust: A Graph-Based Document Clustering Technique

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
A Novelty-based Clustering Method for On-line Documents

World Wide Web
PuReD-MCL

Bioinformatics
TinyLex: static n-gram index pruning with perfect recall

Proceedings of the 17th ACM conference on Information and knowledge management
Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
WordNet-based text document clustering

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
Document Clustering with Cluster Refinement and Non-negative Matrix Factorization

ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
A probabilistic rating inference framework for mining user preferences from reviews

World Wide Web
TagClus: a random walk-based method for tag clustering

Knowledge and Information Systems - Special Issue: Best Papers of the Fifth International Conference on Advanced Data Mining and Applications (ADMA 2009)
A neural network for text representation

ICANN'05 Proceedings of the 15th international conference on Artificial neural networks: formal models and their applications - Volume Part II
Indexing and querying segmented web pages: the BlockWeb Model

World Wide Web
Statistical semantics for enhancing document clustering

Knowledge and Information Systems - Special Issue on "Context-Aware Data Mining (CADM)"
On ontology-driven document clustering using core semantic features

Knowledge and Information Systems - Special Issue on "Context-Aware Data Mining (CADM)"
An integration of fuzzy association rules and WordNet for document clustering

Knowledge and Information Systems - Special Issue on Data Warehousing and Knowledge Discovery from Sensors and Streams
Clustering web documents based on knowledge granularity

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
D2S: Document-to-sentence framework for novelty detection

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with "false correlation". In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.