PH2: an hadoop-based framework for mining structural properties from the PDB database

Authors:
Scott Hazelhurst
Affiliations:
University of the Witwatersrand, Johannesburg, Wits, South Africa
Venue:
SAICSIT '10 Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists
Year:
2010

Citing 8
Cited 0

Molecular biology for computer scientists

Artificial intelligence and molecular biology
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Bioinformatics—an introduction for computer scientists

ACM Computing Surveys (CSUR)
PDB-SQL: a storage engine for macromolecular data

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
BioJava

Bioinformatics
Pro Hadoop

Pro Hadoop
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide

Quantified Score

Hi-index	0.00

Visualization

Abstract

PH2 is an Hadoop and SQL-based tool for extracting information out of the Protein Database (PDB) quickly. The PDB database is stored as a set of Hadoop sequence files in a replicated way on the Hadoop Distributed File System. PH2 then allows a user to provide queries about 3D structures (and other properties) in SQL, and for these queries to be run in a highly-parallel manner using the Hadoop framework. PDB is an important source of information about structural and other properties of proteins, and it currently contains about 65000 protein structures. Determining which proteins have particular shapes is an important bioinformatics application. PH2 parses each PDB file, creates a SQL database for it and then performs the appropriate queries. Experiments performed on a small local cluster and a large shared cluster show that the application is highly-scalable. On the large cluster, a complex real query takes less than 4 minutes to search the whole of PDB.