Combinatorial pattern discovery for scientific data: some preliminary results

  • Authors:
  • Jason Tsong-Li Wang;Gung-Wei Chirn;Thomas G. Marr;Bruce Shapiro;Dennis Shasha;Kaizhong Zhang

  • Affiliations:
  • Computer and Information Science, New Jersey Institute of Technology, Newark, NJ;Computer and Information Science, New Jersey Institute of Technology, Newark, NJ;Cold Spring Harbor Laboratory, 100 Bungtown Rodad, Cold Spring Harbor, NY;Image Processing Section, Laboratory of Mathematical Biology, Division of Cancer Biology and Diagnosis, National Cancer, Institute, National Institutes of Health, Frederick, MD;Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY;Department of Computer Science, The University of Western Ontario, London, Ontario, Canada N6A 5B7

  • Venue:
  • SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.