Text document clustering based on frequent word meaning sequences

Authors:
Yanjun Li;Soon M. Chung;John D. Holt
Affiliations:
Department of Computer and Information Sciences, Fordham University, Bronx, NY 10458, USA;Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435, USA;Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435, USA
Venue:
Data & Knowledge Engineering
Year:
2008

Citing 18
Cited 28

Algorithms for clustering data

Algorithms for clustering data
Fast parallel and serial approximate string matching

Journal of Algorithms
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering transactions using large items

Proceedings of the eighth international conference on Information and knowledge management
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Information Retrieval

Information Retrieval
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Discovery of Frequent Word Sequences in Text

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Non-contiguous word sequences for information retrieval

MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing
WordNet-based text document clustering

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data

A New Document Clustering Algorithm for Topic Discovering and Labeling

CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Hierarchical Star Clustering Algorithm for Dynamic Document Collections

CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Text classification based on multi-word with support vector machine

Knowledge-Based Systems
An active learning framework for semi-supervised document clustering with language modeling

Data & Knowledge Engineering
Mining fuzzy association rules from questionnaire data

Knowledge-Based Systems
Clustering of document collection - A weighting approach

Expert Systems with Applications: An International Journal
Using ontology to improve precision of terminology extraction from documents

Expert Systems with Applications: An International Journal
An efficient hybrid data clustering method based on K-harmonic means and Particle Swarm Optimization

Expert Systems with Applications: An International Journal
Performance evaluation of density-based clustering methods

Information Sciences: an International Journal
Text document clustering based on neighbors

Data & Knowledge Engineering
Collaborative content and user-based web ontology learning system

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Dynamic hierarchical algorithms for document clustering

Pattern Recognition Letters
A document clustering algorithm for discovering and describing topics

Pattern Recognition Letters
Knowledge discovery from text learning for ontology modeling

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Text clustering using frequent itemsets

Knowledge-Based Systems
Applying text and data mining techniques to forecasting the trend of petitions filed to e-People

Expert Systems with Applications: An International Journal
Development and application of a keyword-based knowledge map for effective R&D planning

Scientometrics
A comparative study of TF*IDF, LSI and multi-words for text classification

Expert Systems with Applications: An International Journal
Using a new relational concept to improve the clustering performance of search engines

Information Processing and Management: an International Journal
A clustering study of a 7000 EU document inventory using MDS and SOM

Expert Systems with Applications: An International Journal
A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering

Expert Systems with Applications: An International Journal
Internet public opinion hotspot detection research based on k-means algorithm

ICSI'10 Proceedings of the First international conference on Advances in Swarm Intelligence - Volume Part II
The optimum clustering framework: implementing the cluster hypothesis

Information Retrieval
Measuring the coverage and redundancy of information search services on e-commerce platforms

Electronic Commerce Research and Applications
Query Recommendation for Improving Search Engine Results

International Journal of Information Retrieval Research
Comparative study of text clustering techniques in virtual worlds

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Enhancing short text clustering with small external repositories

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Clustering web documents using hierarchical representation with multi-granularity

World Wide Web

Quantified Score

Hi-index	0.01

Visualization

Abstract

Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms.