A grid infrastructure for text mining of full text articles and creation of a knowledge base of gene relations

  • Authors:
  • Jeyakumar Natarajan;Niranjan Mulay;Catherine DeSesa;Catherine J. Hack;Werner Dubitzky;Eric G. Bremer

  • Affiliations:
  • Bioinformatics Research Group, University of Ulster, UK;United Devices Inc, Austin, TX;Brain Tumor Research Program, Children's Memorial Hospital, Feinberg School of Medicine, Northwestern University, Chicago, IL;Bioinformatics Research Group, University of Ulster, UK;Bioinformatics Research Group, University of Ulster, UK;Brain Tumor Research Program, Children's Memorial Hospital, Feinberg School of Medicine, Northwestern University, Chicago, IL

  • Venue:
  • ISBMDA'05 Proceedings of the 6th International conference on Biological and Medical Data Analysis
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We demonstrate the application of a grid infrastructure for conducting text mining over distributed data and computational resources. The approach is based on using LexiQuest Mine, a text mining workbench, in a grid computing environment. We describe our architecture and approach and provide an illustrative example of mining full-text journal articles to create a knowledge base of gene relations. The number of patterns found increased from 0.74 per full-text articles from a corpus of 1000 articles to 0.83 when the corpus contained 5000 articles. However, it was also shown that mining a corpus of 5000 full-text articles took 26 hours on a single computer, whilst the process was completed in less than 2.5 hours on a grid comprising of 20 computers. Thus whilst increasing the size of the corpus improved the efficiency of the text-mining process, a grid infrastructure was required to complete the task in a timely manner.