Evaluation of hierarchical clustering algorithms for document datasets

Authors:
Ying Zhao;George Karypis
Affiliations:
University of Minnesota, Minneapolis, MN;University of Minnesota, Minneapolis, MN
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 15
Cited 104

Algorithms for clustering data

Algorithms for clustering data
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
On the merits of building categorization systems by supervised clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)

Misuse detection for information retrieval systems

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Proceedings of the 13th international conference on World Wide Web
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
A partial join approach for mining co-location patterns

Proceedings of the 12th annual ACM international workshop on Geographic information systems
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
A divide-and-merge methodology for clustering

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A general model for clustering binary data

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A hybrid unsupervised approach for document clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Clustering high-dimensional data using an efficient and effective data space reduction

Proceedings of the 14th ACM international conference on Information and knowledge management
A characterization of data mining algorithms on a modern processor

DaMoN '05 Proceedings of the 1st international workshop on Data management on new hardware
Maxdiff kd-trees for data condensation

Pattern Recognition Letters
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Maximum likelihood combination of multiple clusterings

Pattern Recognition Letters
Effective document clustering for large heterogeneous law firm collections

ICAIL '05 Proceedings of the 10th international conference on Artificial intelligence and law
Clustering quality measures for data samples with multiple labels

DBA'06 Proceedings of the 24th IASTED international conference on Database and applications
Incremental hierarchical clustering of text documents

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A divide-and-merge methodology for clustering

ACM Transactions on Database Systems (TODS)
Answer extraction, semantic clustering, and extractive summarization for clinical question answering

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Language model-based document clustering using random walks

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
XML schema clustering with semantic and hierarchical similarity measures

Knowledge-Based Systems
Consensus clustering

Intelligent Data Analysis
Exploiting parallelism to support scalable hierarchical clustering

Journal of the American Society for Information Science and Technology
Spectral clustering by recursive partitioning

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Clustering support for automated tracing

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Content free clustering for search engine query log

SMO'07 Proceedings of the 7th WSEAS International Conference on Simulation, Modelling and Optimization
A heuristic algorithm for clustering rooted ordered trees

Intelligent Data Analysis
Leveraging user query log: toward improving image data clustering

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Self-taught clustering

Proceedings of the 25th international conference on Machine learning
Hypergraph partitioning for document clustering: a unified clique perspective

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Topical query decomposition

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying domain expertise of developers from source code

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Improving density-based methods for hierarchical clustering of web pages

Data & Knowledge Engineering
Comparing Non-parametric Ensemble Methods for Document Clustering

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Labeling Nodes of Automatically Generated Taxonomy for Multi-type Relational Datasets

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Constrained locally weighted clustering

Proceedings of the VLDB Endowment
Personalized cluster-based semantically enriched web search for e-learning

Proceedings of the 2nd international workshop on Ontologies and information systems for the semantic web
A new method for hierarchical clustering combination

Intelligent Data Analysis
A schema matching-based approach to XML schema clustering

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Recovery Rate of Clustering Algorithms

PSIVT '09 Proceedings of the 3rd Pacific Rim Symposium on Advances in Image and Video Technology
A recommender system for requirements elicitation in large-scale software projects

Proceedings of the 2009 ACM symposium on Applied Computing
Short Text Clustering for Search Results

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Enhancing Document Clustering through Heuristics and Summary-Based Pre-processing

Proceedings of the Symposium on Human Interface 2009 on Human Interface and the Management of Information. Information and Interaction. Part II: Held as part of HCI International 2009
Exploiting Domain Knowledge by Automated Taxonomy Generation in Recommender Systems

EC-Web 2009 Proceedings of the 10th International Conference on E-Commerce and Web Technologies
Vector-Based Unsupervised Word Sense Disambiguation for Large Number of Contexts

TSD '09 Proceedings of the 12th International Conference on Text, Speech and Dialogue
Relaxed Transfer of Different Classes via Spectral Partition

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Multi-model Ontology-Based Hybrid Recommender System in E-learning Domain

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Semantic clustering of XML documents

ACM Transactions on Information Systems (TOIS)
A Speed-Up Hierarchical Compact Clustering Algorithm for Dynamic Document Collections

CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Dynamic hierarchical algorithms for document clustering

Pattern Recognition Letters
Clustering dense graphs: A web site graph paradigm

Information Processing and Management: an International Journal
Automatic index construction for multimedia digital libraries

Information Processing and Management: an International Journal
Creating personal histories from the web using namesake disambiguation and event extraction

ICWE'07 Proceedings of the 7th international conference on Web engineering
A novel hierarchical-clustering-combination scheme based on fuzzy-similarity relations

IEEE Transactions on Fuzzy Systems
Hierarchical co-clustering for web queries and selected URLs

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Partitional clustering experiments with news documents

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Building geospatial data collections with location-based games

KI'09 Proceedings of the 32nd annual German conference on Advances in artificial intelligence
Biomedical question answering: A survey

Computer Methods and Programs in Biomedicine
Evolutionary clustering using frequent itemsets

Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques
Automated model grouping

Proceedings of the IEEE/ACM international conference on Automated software engineering
Graph grammar representation for collaborative sample-based music creation

Proceedings of the 5th Audio Mostly Conference: A Conference on Interaction with Sound
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
Using text mining techniques in electronic data interchange environment

WSEAS Transactions on Computers
Maximum normalized spacing for efficient visual clustering

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
ANITA: a narrative interpretation of taxonomies for their adaptation to text collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Frequent itemset based hierarchical document clustering using Wikipedia as external knowledge

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Improving the dynamic hierarchical compact clustering algorithm by using feature selection

CIARP'10 Proceedings of the 15th Iberoamerican congress conference on Progress in pattern recognition, image analysis, computer vision, and applications
XML data clustering: An overview

ACM Computing Surveys (CSUR)
Co-word analysis of the trends in stem cells field based on subject heading weighting

Scientometrics
JACKSTRAWS: picking command and control connections from bot traffic

SEC'11 Proceedings of the 20th USENIX conference on Security
Clustering for semi-supervised spam filtering

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Building a topic hierarchy using the bag-of-related-words representation

Proceedings of the 11th ACM symposium on Document engineering
Improving document clustering using Okapi BM25 feature weighting

Information Retrieval
A novel hierarchical document clustering algorithm based on a kNN connection graph

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
COWES: clustering web users based on historical web sessions

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
XCLS: a fast and effective clustering algorithm for heterogenous XML documents

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Dynamic hierarchical compact clustering algorithm

CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications
XML documents clustering by structures

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Editorial: Narrative-based taxonomy distillation for effective indexing of text collections

Data & Knowledge Engineering
XMine: a methodology for mining XML structure

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
A comparative study on representing units in chinese text clustering

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Name discrimination by clustering similar contexts

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Exploiting probabilistic latent information for the construction of community web directories

UM'05 Proceedings of the 10th international conference on User Modeling
Dynamic pattern mining: an incremental data clustering approach

Journal on Data Semantics II
Topic structure mining for document sets using graph-based analysis

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Topic structure mining using pagerank without hyperlinks

ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Adaptive term weighting through stochastic optimization

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Web traffic profiling and characterization

Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Discovering collective viewpoints on micro-blogging events based on community and temporal aspects

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Characterization and exploitation of community structure in cover song networks

Pattern Recognition Letters
PROBABILISTIC HEURISTICS FOR HIERARCHICAL WEB DATA CLUSTERING

Computational Intelligence
A fast and effective partitioning algorithm for document clustering

ICDEM'10 Proceedings of the Second international conference on Data Engineering and Management
Collective viewpoint identification of low-level participation

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Predicting web user behavior using learning-based ant colony optimization

Engineering Applications of Artificial Intelligence
Query log analysis with GALATEAS LangLog

EACL '12 Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Approximate clone detection in repositories of business process models

BPM'12 Proceedings of the 10th international conference on Business Process Management
An innovative way for mining clinical and administrative healthcare data

AMT'12 Proceedings of the 8th international conference on Active Media Technology
Structure inference for linked data sources using clustering

Proceedings of the Joint EDBT/ICDT 2013 Workshops
A hierarchical clusterer ensemble method based on boosting theory

Knowledge-Based Systems
Information-theoretic term weighting schemes for document clustering

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Comparative study of text clustering techniques in virtual worlds

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Modeling and predicting the task-by-task behavior of search engine users

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Discovering tasks from search engine query logs

ACM Transactions on Information Systems (TOIS)
Semantic smoothing for text clustering

Knowledge-Based Systems
Learning a taxonomy of predefined and discovered activity patterns

Journal of Ambient Intelligence and Smart Environments

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.