Non-segmented Document Clustering Using Self-Organizing Map and Frequent Max Substring Technique

Authors:
Todsanai Chumwatana;Kok Wai Wong;Hong Xie
Affiliations:
School of Information Technology, Murdoch University, Murdoch 6150;School of Information Technology, Murdoch University, Murdoch 6150;School of Information Technology, Murdoch University, Murdoch 6150
Venue:
ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
Year:
2009

Citing 5
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Document representation and multilevel measures of document similarity

NAACL-DocConsortium '06 Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: doctoral consortium
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring mining technique to improve the efficiency of information retrieval. The proposed technique appears to be a promising alternative for clustering non-segmented text documents. To illustrate the proposed technique, experiment on clustering the Thai text documents is presented in this paper. The frequent max substring mining technique is first applied to discover the patterns of interest called Frequent Max substrings or FM from the non-segmented Thai text documents. These discovered patterns are then used as indexing terms, together with their number of occurrences, to form a document vector. SOM is then applied to generate the document cluster map by using the document vector. As a result, the generated document cluster map can be used to find the relevant documents according to a user's query more efficiently.