Text clustering using frequent itemsets

Authors:
Wen Zhang;Taketoshi Yoshida;Xijin Tang;Qing Wang
Affiliations:
Lab for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing 100190, PR China;School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, Japan;Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, PR China;Lab for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing 100190, PR China
Venue:
Knowledge-Based Systems
Year:
2010

Citing 11
Cited 8

Competitive learning algorithms for vector quantization

Neural Networks
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
On the merits of building categorization systems by supervised clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Classifying text documents by associating terms with text categories

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
BIDE: Efficient Mining of Frequent Closed Sequences

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
Improving effectiveness of mutual information for substantival multiword expression extraction

Expert Systems with Applications: An International Journal
Document clustering based on maximal frequent sequences

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Semi-supervised locally discriminant projection for classification and recognition

Knowledge-Based Systems
Finding association rules in semantic web data

Knowledge-Based Systems
A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data

Knowledge-Based Systems
An efficient incremental method for generating equivalence groups of search results in information retrieval and queries

Knowledge-Based Systems
Decision support for improved service effectiveness using domain aware text mining

Knowledge-Based Systems
Semantically-grounded construction of centroids for datasets with textual attributes

Knowledge-Based Systems
A lattice-based approach for mining most generalization association rules

Knowledge-Based Systems
Clustering Software Components for Component Reuse and Program Restructuring

Proceedings of the Second International Conference on Innovative Computing and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Frequent itemset originates from association rule mining. Recently, it has been applied in text mining such as document categorization, clustering, etc. In this paper, we conduct a study on text clustering using frequent itemsets. The main contribution of this paper is three manifolds. First, we present a review on existing methods of document clustering using frequent patterns. Second, a new method called Maximum Capturing is proposed for document clustering. Maximum Capturing includes two procedures: constructing document clusters and assigning cluster topics. We develop three versions of Maximum Capturing based on three similarity measures. We propose a normalization process based on frequency sensitive competitive learning for Maximum Capturing to merge cluster candidates into predefined number of clusters. Third, experiments are carried out to evaluate the proposed method in comparison with CFWS, CMS, FTC and FIHC methods. Experiment results show that in clustering, Maximum Capturing has better performances than other methods mentioned above. Particularly, Maximum Capturing with representation using individual words and similarity measure using asymmetrical binary similarity achieves the best performance. Moreover, topics produced by Maximum Capturing distinguished clusters from each other and can be used as labels of document clusters.