A relational model of data for large shared data banks
Communications of the ACM
PH2: an hadoop-based framework for mining structural properties from the PDB database
SAICSIT '10 Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists
Hi-index | 0.00 |
The Protein Data Bank (PDB) was established in 1971 as a repository for macromolecular crystal structure data. Recent development of high throughput structural genomic technologies has produced massive quantities of data, and the amount of macromolecular structure data is increasing exponentially. The original format for these files was designed to be human-readable, rather than machine readable, and limited attention was played to standard vocabularies and data formats. It can be difficult to access these data for calculations in an efficient manner. This paper discusses the creation of PDB-SQL, a model database originally designed for the storage of alpha carbon coordinates and other types of information, of all protein structures in the PDB. We describe the architecture of this database and present data indicating the timing required to populate the database with all structures currently in the PDB. Comparison of storage requirements and time required to perform computational tasks are presented. Finally, we describe future development that would allow all macromolecular structure data to be stored in this database.