K-d trees for semidynamic point sets
SCG '90 Proceedings of the sixth annual symposium on Computational geometry
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
The SR-tree: an index structure for high-dimensional nearest neighbor queries
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Similarity Search Over Time-Series Data Using Wavelets
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
LSH forest: self-tuning indexes for similarity search
WWW '05 Proceedings of the 14th international conference on World Wide Web
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
SIAM Journal on Computing
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions
FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Principles of hash-based text retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Similarity Search over Future Stream Time Series
IEEE Transactions on Knowledge and Data Engineering
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Alert Detection in System Logs
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Clustering event logs using iterative partitioning
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Inexact Local Alignment Search over Suffix Arrays
BIBM '09 Proceedings of the 2009 IEEE International Conference on Bioinformatics and Biomedicine
Mining console logs for large-scale system problem detection
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
LogTree: A Framework for Generating System Events from Raw Textual Logs
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
WHAM: a high-throughput sequence alignment method
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
LogSig: generating system events from raw textual logs
Proceedings of the 20th ACM international conference on Information and knowledge management
Discovering lag intervals for temporal dependencies
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
An integrated framework for optimizing automatic monitoring systems in large IT infrastructures
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
Sequential data is prevalent in many scientific and commercial applications such as bioinformatics, system security and networking. Similarity search has been widely studied for symbolic and time series data in which each data object is a symbol or numeric value. Textual event sequences are sequences of events, where each object is a message describing an event. For example, system logs are typical textual event sequences and each event is a textual message recording internal system operations, statuses, configuration modifications or execution errors. Similar segments of an event sequence reveals similar system behaviors in the past which are helpful for system administrators to diagnose system problems. Existing search indexing for textual data only focus on unordered data. Substring matching methods are able to efficiently find matched segments over a sequence, however, their sequences are single values rather than texts. In this paper, we propose a method, suffix matrix, for efficiently searching similar segments over textual event sequences. It provides an integration of two disparate techniques: locality-sensitive hashing and suffix arrays. This method also supports the k-dissimilar segment search. A k-dissimilar segment is a segment that has at most k dissimilar events to the query sequence. By using random sequence mask proposed in this paper, this method can have a high probability to reach all k-dissimilar segments without increasing much search cost. We conduct experiments on real system log data and the experimental results show that our proposed method outperforms alternative methods using existing techniques.