Molecular biology for computer scientists
Artificial intelligence and molecular biology
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Bioinformatics—an introduction for computer scientists
ACM Computing Surveys (CSUR)
PDB-SQL: a storage engine for macromolecular data
ACM-SE 45 Proceedings of the 45th annual southeast regional conference
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Bioinformatics
Pro Hadoop
Hadoop: The Definitive Guide
Hi-index | 0.00 |
PH2 is an Hadoop and SQL-based tool for extracting information out of the Protein Database (PDB) quickly. The PDB database is stored as a set of Hadoop sequence files in a replicated way on the Hadoop Distributed File System. PH2 then allows a user to provide queries about 3D structures (and other properties) in SQL, and for these queries to be run in a highly-parallel manner using the Hadoop framework. PDB is an important source of information about structural and other properties of proteins, and it currently contains about 65000 protein structures. Determining which proteins have particular shapes is an important bioinformatics application. PH2 parses each PDB file, creates a SQL database for it and then performs the appropriate queries. Experiments performed on a small local cluster and a large shared cluster show that the application is highly-scalable. On the large cluster, a complex real query takes less than 4 minutes to search the whole of PDB.