Hancock: A language for analyzing transactional data streams

  • Authors:
  • Corinna Cortes;Kathleen Fisher;Daryl Pregibon;Anne Rogers;Frederick Smith

  • Affiliations:
  • AT&T Labs, New York, NY;AT&T Labs, NJ;AT&T Labs, New York, NY;AT&T Labs, Chicago, IL;AT&T Labs, Natick, MA

  • Venue:
  • ACM Transactions on Programming Languages and Systems (TOPLAS)
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Massive transaction streams present a number of opportunities for data mining techniques. The transactions in such streams might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how the transactors, for example, credit-card numbers or IP addresses, use the associated services.Over the past 5 years, we have computed evolving profiles (called signatures) of transactors in several very large data streams. The signature for each transactor captures the salient features of his or her behavior through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). Originally, we wrote such programs directly in C, but because these programs often sacrificed readability for performance, they were difficult to verify and maintain.Hancock is a domain-specific language we created to express computationally efficient signature programs cleanly. In this paper, we describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, we present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources.