NILE: wide-area computing for high energy physics

  • Authors:
  • Keith Marzullo;Michael Ogg;Aleta Ricciardi;Alessandro Amoroso;F. Andrew Calkins;Eric Rothfus

  • Affiliations:
  • University of California at San Diego, La Jolla, CA;University of Texas at Austin, Austin, TX;University of Texas at Austin, Austin, TX;Università di Bologna, Bologna, Italy;University of California at San Diego, La Jolla, CA;University of Texas at Austin, Austin, TX

  • Venue:
  • EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

The CLEO project [2], centered at Cornell University, is alarge-scale high energy physics project. The goals of the projectarise from an esoteric question---why is there apparently so littleantimatter in the universe?---and the computational problems thatarise in trying to answer this question are quite challenging.To answer this question, the CESR storage ring at Cornell isused to generate a beam of electrons directed at an equally strongbeam of positrons. These two beams meet inside a detector that isembedded in a magnetic field and is equipped with sensors. Thecollisions of electrons and positrons generate several secondarysubatomic particles. Each collision is called an event andis sensed by detecting charged particles (via the ionization theyproduce in a drift chamber) and neutral particles (in the case ofphotons, via their deposition of energy in a crystal calorimeter),as well as by other specialized detector elements. Most events areignored, but some are recorded in what is called raw data(typically 8Kbytes per event). Offline, a second program calledpass2 computes, for each event, the physical properties ofthe particles, such as their momenta, masses, and charges. Thiscompute-bound program produces a new set of records describing theevents (now typically 20Kbytes per event). Finally, a third programreads these events, and produces a lossily-compressed version ofonly certain frequently-accessed fields, written in what is calledroar format (typically 2Kbytes per event).The physicists analyze this data with programs that are, for themost part, embarrassingly parallel and I/O limited. Such programstypically compute a result based on a projection of a selection ofa large number of events, where the result is insensitive to theorder in which the events are processed. For example, a program mayconstruct histograms, or compute statistics, or cull the rawdata for physical inspection. The projection is either the completepass2 record or (much more often) the smaller roarrecord, and the selection is done in an ad-hoc manner by theprogram itself.Other programs are run as well. For example, a Monte Carlosimulation of the experiment is also run (called montecarlo) in order to correct the data for detector acceptance andinefficiencies, as well as testing aspects of the model used tointerpret the data. This program is compute bound. Anotherimportant example is called recompress. Roughly every twoyears, improvements in detector calibration and reconstructionalgorithms make it worthwhile to recompute more accuratepass2 data (and hence, more accurate roar data) fromall of the raw data. This program is compute-bound (itcurrently requires 24 200-MIP workstations running flat out forthree months) and so must be carefully worked into the schedule sothat it does not seriously impact the ongoing operations.Making this more concrete, the current experiment generatesapproximately 1 terabyte of event data a year. Only recentroar data can be kept on disk; all other data must reside ontape. The data processing demands consume approximately 12,000SPECint92 cycles a year. Improvements in the performance of CESRand the sensitivity of the detector will cause both of these valuesto go up by a factor of ten in the next few years, which willcorrespondingly increase the storage and computational needs by afactor of ten.The CLEO project prides itself on being able to do big scienceon a tight budget, and so the programming environment that the CLEOproject provides for researchers is innovative but somewhatprimitive. Jobs that access the entire data set can take days tocomplete. To circumvent limited access to tape, the network, orcompute resources close to the central disk, physicists often dopreliminary selections and projections (called skims) tocreate private disk data sets of events for further local analysis.Limited resources usually exact a high human price for resource andjob management and ironically, can sometimes lead toinefficiencies. Given the increase in data storage, data retrieval,and computational needs, it has become clear that the CLEOphysicists require a better distributed environment in which to dotheir work.Hence, an NSF-funded National Challenge project was started withparticipants from both high energy physics, distributed computing,and data storage, in order to provide a better environment for theCLEO experiment. The goals of this project, called NILE [7], are:to build a scalable environment for storing and processing HighEnergy Physics data from the CLEO experiment. The environment mustscale to allow 100 terabytes or more of data to be addressable, andto be able to use several hundreds of geographically dispersedprocessors.to radically decrease the processing time of computationsthrough parallelism.to be practicable. NILE, albeit in a limited form,should be deployed very soon, and evolve to its full form by theend of the project in June 1999.Finally, the CLEO necessity of building on a budget carries overto NILE. There aresome more expensive resources, such as ATM switches and tape silos,that it will be necessary to use. However, as far as possible weare using commodity equipment, and free or inexpensive softwarewhenever possible. For example, one of our principal developmentplatforms is Pentium-based PCs, interconnected with 100 MbpsEthernet, running Linux and the GNU suite of tools.