Algorithms for clustering data
Algorithms for clustering data
Fast parallel and serial approximate string matching
Journal of Algorithms
Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering transactions using large items
Proceedings of the eighth international conference on Information and knowledge management
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Reducing the space requirement of suffix trees
Software—Practice & Experience
Information Retrieval
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Discovery of Frequent Word Sequences in Text
Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Frequent term-based text clustering
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal suffix tree construction with large alphabets
FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction
FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Ontologies Improve Text Document Clustering
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Non-contiguous word sequences for information retrieval
MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing
WordNet-based text document clustering
ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
A New Document Clustering Algorithm for Topic Discovering and Labeling
CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Hierarchical Star Clustering Algorithm for Dynamic Document Collections
CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Text classification based on multi-word with support vector machine
Knowledge-Based Systems
An active learning framework for semi-supervised document clustering with language modeling
Data & Knowledge Engineering
Mining fuzzy association rules from questionnaire data
Knowledge-Based Systems
Clustering of document collection - A weighting approach
Expert Systems with Applications: An International Journal
Using ontology to improve precision of terminology extraction from documents
Expert Systems with Applications: An International Journal
An efficient hybrid data clustering method based on K-harmonic means and Particle Swarm Optimization
Expert Systems with Applications: An International Journal
Performance evaluation of density-based clustering methods
Information Sciences: an International Journal
Text document clustering based on neighbors
Data & Knowledge Engineering
Collaborative content and user-based web ontology learning system
FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Dynamic hierarchical algorithms for document clustering
Pattern Recognition Letters
A document clustering algorithm for discovering and describing topics
Pattern Recognition Letters
Knowledge discovery from text learning for ontology modeling
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Text clustering using frequent itemsets
Knowledge-Based Systems
Applying text and data mining techniques to forecasting the trend of petitions filed to e-People
Expert Systems with Applications: An International Journal
A comparative study of TF*IDF, LSI and multi-words for text classification
Expert Systems with Applications: An International Journal
Using a new relational concept to improve the clustering performance of search engines
Information Processing and Management: an International Journal
A clustering study of a 7000 EU document inventory using MDS and SOM
Expert Systems with Applications: An International Journal
A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering
Expert Systems with Applications: An International Journal
Internet public opinion hotspot detection research based on k-means algorithm
ICSI'10 Proceedings of the First international conference on Advances in Swarm Intelligence - Volume Part II
The optimum clustering framework: implementing the cluster hypothesis
Information Retrieval
Measuring the coverage and redundancy of information search services on e-commerce platforms
Electronic Commerce Research and Applications
Query Recommendation for Improving Search Engine Results
International Journal of Information Retrieval Research
Comparative study of text clustering techniques in virtual worlds
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Enhancing short text clustering with small external repositories
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Hi-index | 0.01 |
Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms.