A new algorithm for fast discovery of maximal sequential patterns in a document collection

  • Authors:
  • René Arnulfo García-Hernández;José Francisco Martínez-Trinidad;Jesús Ariel Carrasco-Ochoa

  • Affiliations:
  • National Institute of Astrophysics, Optics and Electronics (INAOE), Puebla, México;National Institute of Astrophysics, Optics and Electronics (INAOE), Puebla, México;National Institute of Astrophysics, Optics and Electronics (INAOE), Puebla, México

  • Venue:
  • CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sequential pattern mining is an important tool for solving many data mining tasks and it has broad applications. However, only few efforts have been made to extract this kind of patterns in a textual database. Due to its broad applications in text mining problems, finding these textual patterns is important because they can be extracted from text independently of the language. Also, they are human readable patterns or descriptors of the text, which do not lose the sequential order of the words in the document. But the problem of discovering sequential patterns in a database of documents presents special characteristics which make it intractable for most of the apriori-like candidate-generation-and-test approaches. Recent studies indicate that the pattern-growth methodology could speed up the sequential pattern mining. In this paper we propose a pattern-growth based algorithm (DIMASP) to discover all the maximal sequential patterns in a document database. Furthermore, DIMASP is incremental and independent of the support threshold. Finally, we compare the performance of DIMASP against GSP, DELISP, GenPrefixSpan and cSPADE algorithms.