G-hash: towards fast kernel-based similarity search in large graph databases

Authors:
Xiaohong Wang;Aaron Smalter;Jun Huan;Gerald H. Lushington
Affiliations:
University of Kansas, Lawrence, KS;University of Kansas, Lawrence, KS;University of Kansas, Lawrence, KS;University of Kansas, Lawrence, KS
Venue:
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Year:
2009

Citing 15
Cited 6

Kernel principal component analysis

Advances in kernel methods
Algorithmics and applications of tree and graph searching

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The complexity of theorem-proving procedures

STOC '71 Proceedings of the third annual ACM symposium on Theory of computing
SVMTorch: support vector machines for large-scale regression problems

The Journal of Machine Learning Research
Graph indexing: a frequent structure-based approach

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Frequent Substructure-Based Approaches for Classifying Chemical Compounds

IEEE Transactions on Knowledge and Data Engineering
Optimal assignment kernels for attributed molecular graphs

ICML '05 Proceedings of the 22nd international conference on Machine learning
Weighted decomposition kernels

ICML '05 Proceedings of the 22nd international conference on Machine learning
Shortest-Path Kernels on Graphs

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Closure-Tree: An Index Structure for Graph Queries

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
SAGA: a subgraph matching tool for biological graphs

Bioinformatics
Graph indexing: tree + delta

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A maximum common substructure-based algorithm for searching and predicting drug-like compounds

Bioinformatics
gApprox: Mining Frequent Approximate Patterns from a Massive Network

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
TALE: A Tool for Approximate Large Graph Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

SAPPER: subgraph indexing and approximate matching in large graphs

Proceedings of the VLDB Endowment
BR-index: an indexing structure for subgraph matching in very large dynamic graphs

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Finding top-k similar graphs in graph databases

Proceedings of the 15th International Conference on Extending Database Technology
Fast top-k similarity queries via matrix compression

Proceedings of the 21st ACM international conference on Information and knowledge management
SWORD: scalable workload-aware data placement for transactional workloads

Proceedings of the 16th International Conference on Extending Database Technology
Facilitating representation and retrieval of structured cases: Principles and toolkit

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others. Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database. Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.