Searching similar segments over textual event sequences

Authors:
Liang Tang;Tao Li;Shu-Ching Chen;Shunzhi Zhu
Affiliations:
Florida International University, Miami, FL, USA;Florida International University, Miami, FL, USA;Florida International University, Miami, FL, USA;Xiamen University of Technology, Xiamen, China
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 25
Cited 0

K-d trees for semidynamic point sets

SCG '90 Proceedings of the sixth annual symposium on Computational geometry
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Similarity Search Over Time-Series Data Using Wavelets

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Similarity Search over Future Stream Time Series

IEEE Transactions on Knowledge and Data Engineering
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Alert Detection in System Logs

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Clustering event logs using iterative partitioning

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Inexact Local Alignment Search over Suffix Arrays

BIBM '09 Proceedings of the 2009 IEEE International Conference on Bioinformatics and Biomedicine
Mining console logs for large-scale system problem detection

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
LogTree: A Framework for Generating System Events from Raw Textual Logs

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
LogSig: generating system events from raw textual logs

Proceedings of the 20th ACM international conference on Information and knowledge management
Discovering lag intervals for temporal dependencies

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
An integrated framework for optimizing automatic monitoring systems in large IT infrastructures

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sequential data is prevalent in many scientific and commercial applications such as bioinformatics, system security and networking. Similarity search has been widely studied for symbolic and time series data in which each data object is a symbol or numeric value. Textual event sequences are sequences of events, where each object is a message describing an event. For example, system logs are typical textual event sequences and each event is a textual message recording internal system operations, statuses, configuration modifications or execution errors. Similar segments of an event sequence reveals similar system behaviors in the past which are helpful for system administrators to diagnose system problems. Existing search indexing for textual data only focus on unordered data. Substring matching methods are able to efficiently find matched segments over a sequence, however, their sequences are single values rather than texts. In this paper, we propose a method, suffix matrix, for efficiently searching similar segments over textual event sequences. It provides an integration of two disparate techniques: locality-sensitive hashing and suffix arrays. This method also supports the k-dissimilar segment search. A k-dissimilar segment is a segment that has at most k dissimilar events to the query sequence. By using random sequence mask proposed in this paper, this method can have a high probability to reach all k-dissimilar segments without increasing much search cost. We conduct experiments on real system log data and the experimental results show that our proposed method outperforms alternative methods using existing techniques.