A method of improving the efficiency of mining sub-structures in molecular structure databases

Authors:
Haibo Li;Yuanzhen Wang;Kevin Lü
Affiliations:
Department of Computing Science, Huazhong University of Science and Technology, Wuhan, China;Department of Computing Science, Huazhong University of Science and Technology, Wuhan, China;BBS, Brunel University, Uxbridge, UK
Venue:
BNCOD'07 Proceedings of the 24th British national conference on Databases
Year:
2007

Citing 3
Cited 0

Frequent Subgraph Discovery

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
gSpan: Graph-Based Substructure Pattern Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
A quickstart in frequent structure mining can make a difference

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

One problem exists in current substructure mining algorithms is that when the sizes of molecular structure databases increase, the costs in terms of both time and space increase to a level that normal PCs are not powerful enough to perform substructure data mining tasks. After examining a number of well known molecular structure databases, we found that there exist a large number of common loop substructures within molecular structure databases, and repeatedly mining these same substructures costs the system resources significantly. In this paper, we introduce a new method: (1) to treat these common loop substructures as some kinds of "atom" structures; (2) to maintain the links of the new "atom" structures with the rest of the molecular structures, and to reorganize the original molecular structures. Therefore we avoid repeat many same operations during mining process and produce less redundant results. We tested the method using four real molecular structure databases: AID2DA'99/CA, AID2DA'99/CM, AID2DA'99 and NCI'99. The results indicated that (1) the speed of substructure mining has been improved due to the reorganization; (2) the number of patterns obtained by mining has been reduced with less redundant information.