Learning layouts of biological datasets semi-automatically

Authors:
Kaushik Sinha;Xuan Zhang;Ruoming Jin;Gagan Agrawal
Affiliations:
Department of Computer Science and Engineering, Ohio State University, Columbus, OH;Department of Computer Science and Engineering, Ohio State University, Columbus, OH;Department of Computer Science and Engineering, Ohio State University, Columbus, OH;Department of Computer Science and Engineering, Ohio State University, Columbus, OH
Venue:
DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences
Year:
2005

Citing 14
Cited 0

Data mining: concepts and techniques

Data mining: concepts and techniques
A Computational Biology Database Digest: Data, Data Analysis, and Data Management

Distributed and Parallel Databases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A Data Transformation System for Biological Data Sources

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
Optimized Seamless Integration of Biomolecular Data

BIBE '01 Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering
Knowledge-Based Integration of Neuroscience Data Sources

SSDBM '00 Proceedings of the 12th International Conference on Scientific and Statistical Database Management
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification

DEXA '03 Proceedings of the 14th International Workshop on Database and Expert Systems Applications
Kleisli, a functional query system

Journal of Functional Programming
DiscoveryLink: a system for integrated access to life sciences data sources

IBM Systems Journal - Deep computing for the life sciences
K2/Kleisli and GUS: experiments in integrated access to genomic data sources

IBM Systems Journal - Deep computing for the life sciences
Transparent access to multiple bioinformatics information sources

IBM Systems Journal - Deep computing for the life sciences
Integration of biological sources: current systems and challenges ahead

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key challenge associated with the existing approaches for data integration and workflow creation for bioinformatics is the effort required to integrate a new data source. As new data sources emerge, and data formats and contents of existing data sources evolve, wrapper programs need to be written or modified. This can be extremely time consuming, tedious, and error-prone. This paper describes our semi-automatic approach for learning the layout of a flat-file bioinformatics dataset. Our approach involves three key steps. The first step is to use a number of heuristics to infer the delimiters used in the program. Specifically, we have developed a metric that uses information on the frequency and starting position of sequences. Based on this metric, we are able to find a superset of delimiters, and then we can seek user input to eliminate the incorrect ones. Our second step involves generating a layout descriptor based on the relative order in which the delimiters occur. Our final step is to generate a parser based on the layout descriptor. Our heuristics for finding the delimiters has been evaluated using three popular flat-file biological datasets.