Data management in distributed stream processing systems

  • Authors:
  • Beth Plale;Nithya Nirmal Vijayakumar

  • Affiliations:
  • Indiana University;Indiana University

  • Venue:
  • Data management in distributed stream processing systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dynamic data-driven applications need to ingest and react to large amounts of information about their environment. In response, the scientific community is adopting on-the-fly data stream processing to avoid the large wait times involved in storing data to disk or a database temporarily while processing data through a reduction/analysis pipeline. The stream processing systems must be highly efficient and scalable in processing data that vary in size, metadata, information content and importance. The dynamic nature of data streams introduces significant and interesting challenges for stream provenance, asynchronous stream joins and missing stream data. This dissertation addresses these challenges. The proposed solutions are implemented and verified in the Calder stream processing system, a continuous query grid service that enables application web services to submit long running, continuously executing queries on data streams. Stream provenance is addressed by both an information model and a collection model, which enables recording of the system activities with minimal increase in real-time processing latency. This approach is validated by experimentally quantifying the perturbation overhead of provenance collection and the scalability of the prototype provenance service implemented in Calder. The challenge of memory conservation in asynchronous stream joins is addressed using a rate-sizing algorithm that sets the join-window size in sliding window joins based on stream rates. A performance study of the rate-sizing algorithm using a realistic workload makes an argument for the use of time-based join window sizes, and for dynamic adaptation of join window sizes in response to stream rates. The problem of temporary gaps in stream data is addressed by a data estimation approach using Kalman filters. Experimental results show that the Kalman filter approach enables real-time one pass prediction of incoming events with good accuracy.