Resource-adaptive real-time new event detection

  • Authors:
  • Gang Luo;Chunqiang Tang;Philip S. Yu

  • Affiliations:
  • IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY

  • Venue:
  • Proceedings of the 2007 ACM SIGMOD international conference on Management of data
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In a document streaming environment, online detection of the first documents that mention previously unseen events is an open challenge. For this online new event detection (ONED) task, existing studies usually assume that enough resources are always available and focus entirely on detection accuracy without considering efficiency. Moreover, none of the existing work addresses the issue of providing an effective and friendly user interface. As a result, there is a significant gap between the existing systems and a system that can be used in practice. In this paper, we propose an ONED framework with the following prominent features. First, a combination of indexing and compression methods is used to improve the document processing rate by orders of magnitude without sacrificing much detection accuracy. Second, when resources are tight, a resource-adaptive computation method is used to maximize the benefit that can be gained from the limited resources. Third, when the new event arrival rate is beyond the processing capability of the consumer of the ONED system, new events are further filtered and prioritized before they are presented to the consumer. Fourth, implicit citation relationships are created among all the documents and used to compute the importance of document sources. This importance information can guide the selection of document sources. We implemented a prototype of our framework on top of IBM's Stream Processing Core middleware. We also evaluated the effectiveness of our techniques on the standard TDT5 benchmark. To the best of our knowledge, this is the first implementation of a real application in a large-scale stream processing system.