A new algorithm for fast discovery of maximal sequential patterns in a document collection

Authors:
René Arnulfo García-Hernández;José Francisco Martínez-Trinidad;Jesús Ariel Carrasco-Ochoa
Affiliations:
National Institute of Astrophysics, Optics and Electronics (INAOE), Puebla, México;National Institute of Astrophysics, Optics and Electronics (INAOE), Puebla, México;National Institute of Astrophysics, Optics and Electronics (INAOE), Puebla, México
Venue:
CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2006

Citing 9
Cited 7

Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Sequence mining in categorical domains: incorporating constraints

Proceedings of the ninth international conference on Information and knowledge management
Data mining: concepts and techniques

Data mining: concepts and techniques
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
DELISP: Efficient Discovery of Generalized Sequential Patterns by Delimited Pattern-Growth Technology

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

ICDE '01 Proceedings of the 17th International Conference on Data Engineering
Visual web mining

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach

IEEE Transactions on Knowledge and Data Engineering
Generalization of pattern-growth methods for sequential pattern mining with gap constraints

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition

Mining Sequential Patterns with Negative Conclusions

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Effect of Preprocessing on Extractive Summarization with Maximal Frequent Sequences

MICAI '08 Proceedings of the 7th Mexican International Conference on Artificial Intelligence: Advances in Artificial Intelligence
Using lexical patterns for extracting hyponyms from the web

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Terms derived from frequent sequences for extractive text summarization

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Authorship attribution using word sequences

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
A text mining approach for definition question answering

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Using machine learning and text mining in question answering

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sequential pattern mining is an important tool for solving many data mining tasks and it has broad applications. However, only few efforts have been made to extract this kind of patterns in a textual database. Due to its broad applications in text mining problems, finding these textual patterns is important because they can be extracted from text independently of the language. Also, they are human readable patterns or descriptors of the text, which do not lose the sequential order of the words in the document. But the problem of discovering sequential patterns in a database of documents presents special characteristics which make it intractable for most of the apriori-like candidate-generation-and-test approaches. Recent studies indicate that the pattern-growth methodology could speed up the sequential pattern mining. In this paper we propose a pattern-growth based algorithm (DIMASP) to discover all the maximal sequential patterns in a document database. Furthermore, DIMASP is incremental and independent of the support threshold. Finally, we compare the performance of DIMASP against GSP, DELISP, GenPrefixSpan and cSPADE algorithms.