Lineage for Markovian stream event queries
Proceedings of the 10th ACM International Workshop on Data Engineering for Wireless and Mobile Access
Approximation trade-offs in a Markovian stream warehouse: An empirical study
Information Systems
Hi-index | 0.00 |
A huge amount of the world's data is both sequential and low-level. Many applications consume higher-level information, such as words and sentences, that is inferred from low-level sequences such as raw audio signals using a model (e.g., a hidden Markov model). This inference process is typically statistical, resulting in high-level streams that are imprecise. Common queries on this data include sequence-finding event queries (e.g. "Find all times when the phrase 'Barack Obama...veto' occurs in the NPR news podcast from July 9."), aggregates of these event queries (e.g. "How many times do 2008 NPR podcasts use the phrase 'Barack Obama...veto'?"), and queries on the lineage of event queries (e.g. "What words appeared between the word 'Obama' and 'veto' in the previous query?"). These queries are difficult to support efficiently because of the large volumes and rich semantics of imprecise data, but they are critical for allowing applications to effectively leverage the rich information contained in these imprecise streams. In this thesis, we introduce Lahar, the first database system for a common type of imprecise, sequential model called a Markovian stream. Lahar includes algorithms for efficiently processing event queries, aggregated event queries, and event query lineage. Lahar accelerates performance and scalability using several techniques, including a set of novel Markovian stream indices and novel methods for approximating Markovian streams. Through experiments on two real-world datasets (one collected from an office-building RFID deployment and the other collected from audio pod-casts) we demonstrate that Lahar is an efficient Markovian stream warehousing system.