Grid-based Indexing of a Newswire Corpus

Authors:
Baden Hughes;Srikumar Venugopal;Rajkumar Buyya
Affiliations:
The University of Melbourne, Australia;The University of Melbourne, Australia;The University of Melbourne, Australia
Venue:
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Year:
2004

Citing 7
Cited 1

The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Applying scheduling and tuning to on-line parallel tomography

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Nimrod: a tool for performing parametrised simulations using distributed workstations

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Parameter scan of an effective group difference pseudopotential using grid computing

New Generation Computing - Grid systems for life sciences
Neuroscience instrumentation and distributed analysis of brain activity data: a case for eScience on global Grids: Research Articles

Concurrency and Computation: Practice & Experience
Grid-enabling natural language engineering by stealth

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
Blueprint for a high performance NLP infrastructure

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8

Designing a resource broker for heterogeneous grids

Software—Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we report experience in the use of computational grids in the domain of natural language processing, particularly in the area of information extraction, to create query indices for information retrieval tasks. Given the prevalence of large corpora in the natural language processing domain, computational grids offer significant utility to researchers in the domain who are reaching the bounds of computational efficiency. We leverage the affinities between the segmented data sources prevalent in natural language processing and the parallelisation model from the grid domain. The experiment reported here is a large-scale newswire corpus indexing task, with the goal to efficiently create a queryable index of the entire corpus. By parallelising the indexing task and executing it on an Australian computational grid, we observe overall performance improvement of a 2.26x speedup over the same experiment on a single computational node. In addition to reporting the raw performance impact, we reflect on a number of interesting points discovered during the execution of the experiments and propose a number of new requirements for grid middleware.