Document representation based on maximal frequent sequence sets

Authors:
Edith Hernández-Reyes;J. Fco. Martínez-Trinidad;J. A. Carrasco-Ochoa;René A. García-Hernández
Affiliations:
National Institute for Astrophysics, Optics and Electronics, Puebla, México;National Institute for Astrophysics, Optics and Electronics, Puebla, México;National Institute for Astrophysics, Optics and Electronics, Puebla, México;National Institute for Astrophysics, Optics and Electronics, Puebla, México
Venue:
CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Year:
2006

Citing 2
Cited 0

A vector space model for automatic indexing

Communications of the ACM
Document clustering based on vector quantization and growing-cell structure

IEA/AIE'2003 Proceedings of the 16th international conference on Developments in applied artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In document clustering, documents are commonly represented through the vector space model as a word vector where the features correspond to the words of the documents. However, there are a lot of words in a document set; therefore the vector size could be enormous. Also, the vector space model does not take into account the word order that could be useful to group similar documents. In order to reduce these disadvantages, we propose a new document representation in which each document is represented as a set of its maximal frequent sequences. The proposed document representation is applied for document clustering and the quality of the clustering is evaluated through internal and external measures, the results are compared with those obtained with the vector space model.