A grid infrastructure for text mining of full text articles and creation of a knowledge base of gene relations

Authors:
Jeyakumar Natarajan;Niranjan Mulay;Catherine DeSesa;Catherine J. Hack;Werner Dubitzky;Eric G. Bremer
Affiliations:
Bioinformatics Research Group, University of Ulster, UK;United Devices Inc, Austin, TX;Brain Tumor Research Program, Children's Memorial Hospital, Feinberg School of Medicine, Northwestern University, Chicago, IL;Bioinformatics Research Group, University of Ulster, UK;Bioinformatics Research Group, University of Ulster, UK;Brain Tumor Research Program, Children's Memorial Hospital, Feinberg School of Medicine, Northwestern University, Chicago, IL
Venue:
ISBMDA'05 Proceedings of the 6th International conference on Biological and Medical Data Analysis
Year:
2005

Citing 6
Cited 2

Constructing Biological Knowledge Bases by Extracting Information from Text Sources

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data

Journal of Biomedical Informatics
Untangling text data mining

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Medstract: creating large-scale information servers for biomedical libraries

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3

Discovering genes-diseases associations from specialized literature using the grid

IEEE Transactions on Information Technology in Biomedicine - Special section on biomedical informatics
GetItFull – a tool for downloading and pre-processing full-text journal articles

KDLL'06 Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature

Quantified Score

Hi-index	0.00

Visualization

Abstract

We demonstrate the application of a grid infrastructure for conducting text mining over distributed data and computational resources. The approach is based on using LexiQuest Mine, a text mining workbench, in a grid computing environment. We describe our architecture and approach and provide an illustrative example of mining full-text journal articles to create a knowledge base of gene relations. The number of patterns found increased from 0.74 per full-text articles from a corpus of 1000 articles to 0.83 when the corpus contained 5000 articles. However, it was also shown that mining a corpus of 5000 full-text articles took 26 hours on a single computer, whilst the process was completed in less than 2.5 hours on a grid comprising of 20 computers. Thus whilst increasing the size of the corpus improved the efficiency of the text-mining process, a grid infrastructure was required to complete the task in a timely manner.