Data mining: concepts and techniques
Data mining: concepts and techniques
A Computational Biology Database Digest: Data, Data Analysis, and Data Management
Distributed and Parallel Databases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
A Data Transformation System for Biological Data Sources
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Internet Information Sources
COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
Optimized Seamless Integration of Biomolecular Data
BIBE '01 Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering
Knowledge-Based Integration of Neuroscience Data Sources
SSDBM '00 Proceedings of the 12th International Conference on Scientific and Statistical Database Management
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
DEXA '03 Proceedings of the 14th International Workshop on Database and Expert Systems Applications
Kleisli, a functional query system
Journal of Functional Programming
DiscoveryLink: a system for integrated access to life sciences data sources
IBM Systems Journal - Deep computing for the life sciences
K2/Kleisli and GUS: experiments in integrated access to genomic data sources
IBM Systems Journal - Deep computing for the life sciences
Transparent access to multiple bioinformatics information sources
IBM Systems Journal - Deep computing for the life sciences
Integration of biological sources: current systems and challenges ahead
ACM SIGMOD Record
Hi-index | 0.00 |
A key challenge associated with the existing approaches for data integration and workflow creation for bioinformatics is the effort required to integrate a new data source. As new data sources emerge, and data formats and contents of existing data sources evolve, wrapper programs need to be written or modified. This can be extremely time consuming, tedious, and error-prone. This paper describes our semi-automatic approach for learning the layout of a flat-file bioinformatics dataset. Our approach involves three key steps. The first step is to use a number of heuristics to infer the delimiters used in the program. Specifically, we have developed a metric that uses information on the frequency and starting position of sequences. Based on this metric, we are able to find a superset of delimiters, and then we can seek user input to eliminate the incorrect ones. Our second step involves generating a layout descriptor based on the relative order in which the delimiters occur. Our final step is to generate a parser based on the layout descriptor. Our heuristics for finding the delimiters has been evaluated using three popular flat-file biological datasets.