Hierarchical Clustering Algorithms for Document Datasets

Authors:
Ying Zhao;George Karypis;Usama Fayyad
Affiliations:
Department of Computer Science and Engineering and Digital Technology Center and Army HPC Research Center, University of Minnesota, Minneapolis 55455;Department of Computer Science and Engineering and Digital Technology Center and Army HPC Research Center, University of Minnesota, Minneapolis 55455;Department of Computer Science and Engineering and Digital Technology Center and Army HPC Research Center, University of Minnesota, Minneapolis 55455
Venue:
Data Mining and Knowledge Discovery
Year:
2005

Citing 25
Cited 75

Algorithms for clustering data

Algorithms for clustering data
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
On the merits of building categorization systems by supervised clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Concept decompositions for large sparse text data using clustering

Machine Learning
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Bipartite graph partitioning and data clustering

Proceedings of the tenth international conference on Information and knowledge management
Information Retrieval

Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Iterative Clustering of High Dimensional Text Data Augmented by Local Search

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning

Refining hierarchical taxonomy structure via semi-supervised learning

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Topic discovery based on text mining techniques

Information Processing and Management: an International Journal
Generating Concept Ontologies through Text Mining

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Using hierarchical clustering for learning theontologies used in recommendation systems

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Ontology evaluation using wikipedia categories for browsing

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Mining, indexing, and searching for textual chemical molecule information on the web

Proceedings of the 17th international conference on World Wide Web
Spectral geometry for simultaneously clustering and ranking query search results

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving density-based methods for hierarchical clustering of web pages

Data & Knowledge Engineering
Document Clustering Using Incremental and Pairwise Approaches

Focused Access to XML Documents
2008 Special Issue: Exploration of a collection of documents in neuroscience and extraction of topics by clustering

Neural Networks
Boosting the ranking function learning process using clustering

Proceedings of the 10th ACM workshop on Web information and data management
An active learning framework for semi-supervised document clustering with language modeling

Data & Knowledge Engineering
Architecture of an Hybrid System for Experimentation on Web Information Retrieval Incorporating Clustering Techniques

KES '07 Knowledge-Based Intelligent Information and Engineering Systems and the XVII Italian Workshop on Neural Networks on Proceedings of the 11th International Conference
Hybridization of K-Means and Harmony Search Methods for Web Page Clustering

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Semantic Patent Clustering for Biomedical Communities

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
External validation measures for K-means clustering: A data distribution perspective

Expert Systems with Applications: An International Journal
Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results

Data Mining and Knowledge Discovery
Harmony K-means algorithm for document clustering

Data Mining and Knowledge Discovery
Ricochet: A Family of Unconstrained Algorithms for Graph Clustering

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
A methodology for extracting temporal properties from sensor network data streams

Proceedings of the 7th international conference on Mobile systems, applications, and services
Multilingual word sense discrimination: a comparative cross-linguistic study

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Requirements-oriented methodology for evaluating ontologies

Information Systems
Requirements-oriented methodology for evaluating ontologies

Information Systems
Two graph-based algorithms for state-of-the-art WSD

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Sequential Hierarchical Pattern Clustering

PRIB '09 Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics
Semeval-2007 task 02: evaluating word sense induction and discrimination systems

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
SemEval-2010 task 14: evaluation setting for word sense induction & disambiguation systems

DEW '09 Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
Integrating knowledge flow mining and collaborative filtering to support document recommendation

Journal of Systems and Software
Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Proceedings of the 18th ACM conference on Information and knowledge management
SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Knowledge and Information Systems
Automatically generating Wikipedia articles: a structure-aware approach

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Dynamic hierarchical algorithms for document clustering

Pattern Recognition Letters
An empirical study of data smoothing methods for memory-based and hybrid collaborative filtering

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
A novel reliable negative method based on clustering for learning from positive and unlabeled examples

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Hierarchical document clustering using local patterns

Data Mining and Knowledge Discovery
Prototype hierarchy based clustering for the categorization and navigation of web collections

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A method for discovering components of human rituals from streams of sensor data

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Clustering polish texts with latent semantic analysis

ICAISC'10 Proceedings of the 10th international conference on Artifical intelligence and soft computing: Part II
The impact of unlinkability on adversarial community detection: effects and countermeasures

PETS'10 Proceedings of the 10th international conference on Privacy enhancing technologies
Spatial statistics of visual keypoints for texture recognition

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Studying the factors influencing automatic user task detection on the computer desktop

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Using semantic techniques to access web data

Information Systems
Citation recommendation without author supervision

Proceedings of the fourth ACM international conference on Web search and data mining
Document clustering using synthetic cluster prototypes

Data & Knowledge Engineering
An evaluation framework for plagiarism detection

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Topic-constrained hierarchical clustering for document datasets

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Multitask Bregman clustering

Neurocomputing
Clust-XPaths: clustering of XML paths

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Exploiting rating behaviors for effective collaborative filtering

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Agglomerative hierarchical clustering with constraints: theoretical and empirical results

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
A quality driven Hierarchical Data Divisive Soft Clustering for information retrieval

Knowledge-Based Systems
Leveraging word confusion networks for named entity modeling and detection from conversational telephone speech

Speech Communication
Text clustering with limited user feedback under local metric learning

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
DHCC: Divisive hierarchical clustering of categorical data

Data Mining and Knowledge Discovery
Document hierarchies from text and links

Proceedings of the 21st international conference on World Wide Web
PROBABILISTIC HEURISTICS FOR HIERARCHICAL WEB DATA CLUSTERING

Computational Intelligence
Leveraging network structure for incremental document clustering

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Evaluation of clustering algorithms for word sense disambiguation

International Journal of Data Analysis Techniques and Strategies
A coherence model based on syntactic patterns

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Document-topic hierarchies from document graphs

Proceedings of the 21st ACM international conference on Information and knowledge management
Measuring the coverage and redundancy of information search services on e-commerce platforms

Electronic Commerce Research and Applications
Hierarchical data organization for effective retrieval of similar shaders

Proceedings of the 2012 ACM Research in Applied Computation Symposium
On the use of consensus clustering for incremental learning of topic hierarchies

SBIA'12 Proceedings of the 21st Brazilian conference on Advances in Artificial Intelligence
Data Field for Hierarchical Clustering

International Journal of Data Warehousing and Mining
A comparative study of dimensionality reduction techniques to enhance trace clustering performances

Expert Systems with Applications: An International Journal
QUEST: discovering insights from survey responses

AusDM '09 Proceedings of the Eighth Australasian Data Mining Conference - Volume 101
Retrieval with semantic sieve

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I
An ensemble clustering model for mining concept drifting stream data in emergency management

DM-IKM '12 Proceedings of the Data Mining and Intelligent Knowledge Management Workshop
Understanding SMS spam in a large cellular network

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval

Knowledge-Based Systems
Efficient hierarchical clustering of large high dimensional datasets

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Efficient web service discovery using hierarchical clustering

AT'13 Proceedings of the Second international conference on Agreement Technologies
Predicting students' final performance from participation in on-line discussion forums

Computers & Education
Semantic smoothing for text clustering

Knowledge-Based Systems
Ontological semantic inference based on cognitive map

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering algorithms that build meaningful hierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularity. This paper focuses on document clustering algorithms that build such hierarchical solutions and (i) presents a comprehensive study of partitional and agglomerative algorithms that use different criterion functions and merging schemes, and (ii) presents a new class of clustering algorithms called constrained agglomerative algorithms, which combine features from both partitional and agglomerative approaches that allows them to reduce the early-stage errors made by agglomerative methods and hence improve the quality of clustering solutions. The experimental evaluation shows that, contrary to the common belief, partitional algorithms always lead to better solutions than agglomerative algorithms; making them ideal for clustering large document collections due to not only their relatively low computational requirements, but also higher clustering quality. Furthermore, the constrained agglomerative methods consistently lead to better solutions than agglomerative methods alone and for many cases they outperform partitional methods, as well.