Mining fuzzy frequent itemsets for hierarchical document clustering

Authors:
Chun-Ling Chen;Frank S. C. Tseng;Tyne Liang
Affiliations:
Department of Computer Science, National Chiao Tung University, HsinChu 300, Taiwan, ROC;Dept. of Information Management, National Kaohsiung 1st University of Science and Technology, YenChao, Kaohsiung 824, Taiwan, ROC;Department of Computer Science, National Chiao Tung University, HsinChu 300, Taiwan, ROC
Venue:
Information Processing and Management: an International Journal
Year:
2010

Citing 17
Cited 0

WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Extracting classification knowledge of Internet documents with mining term associations: a semantic approach

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Mining Text Data: Special Features and Patterns

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A matrix density based algorithm to hierarchically co-cluster documents and words

WWW '03 Proceedings of the 12th international conference on World Wide Web
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Clustering binary data streams with K-means

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Fuzzy data mining for interesting generalized association rules

Fuzzy Sets and Systems - Theme: Learning and modeling
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Document clustering by concept factorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Utilizing Genetic Algorithms to Optimize Membership Functions for Fuzzy Weighted Association Rules Mining

Applied Intelligence
Topic discovery based on text mining techniques

Information Processing and Management: an International Journal
Towards effective document clustering: A constrained K-means based approach

Information Processing and Management: an International Journal
Hierarchical Document Clustering Using Fuzzy Association Rule Mining

ICICIC '08 Proceedings of the 2008 3rd International Conference on Innovative Computing Information and Control
Document clustering using nonnegative matrix factorization

Information Processing and Management: an International Journal
Classification of skewed and homogenous document corpora with class-based and corpus-based keywords

KI'06 Proceedings of the 29th annual German conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F^2IHC) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.