Squeezing long sequence data for efficient similarity search

Authors:
Guojie Song;Bin Cui;Baihua Zheng;Kunqing Xie;Dongqing Yang
Affiliations:
Key Laboratory of Machine Perception, Peking University, Ministry of Education, Beijing, China;School of Electronic Engineering and Computer Science, Peking University, Beijing, China;School of Information System, Singapore Management University, Singapore;Key Laboratory of Machine Perception, Peking University, Ministry of Education, Beijing, China;School of Electronic Engineering and Computer Science, Peking University, Beijing, China
Venue:
APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Year:
2008

Citing 8
Cited 0

FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Evaluating a class of distance-mapping algorithms for data mining and clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
The string edit distance matching problem with moves

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods

Proceedings of the 17th International Conference on Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Reference-based indexing of sequence databases

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications. In this paper, a novel index structure, namely Sequence Embedding Multiset tree(SEM-tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multilevel structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.