Incrementally maintaining classification using an RDBMS

  • Authors:
  • M. Levent Koc;Christopher Ré

  • Affiliations:
  • University of Wisconsin-Madison;University of Wisconsin-Madison

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The proliferation of imprecise data has motivated both researchers and the database industry to push statistical techniques into relational database management systems (RDBMSes). We study strategies to maintain model-based views for a popular statistical technique, classification, inside an RDBMS in the presence of updates (to the set of training examples). We make three technical contributions: (1) A strategy that incrementally maintains classification inside an RDBMS. (2) An analysis of the above algorithm that shows that our algorithm is optimal among all deterministic algorithms (and asymptotically within a factor of 2 of a non-deterministic optimal strategy). (3) A novel hybrid-architecture based on the technical ideas that underlie the above algorithm which allows us to store only a fraction of the entities in memory. We apply our techniques to text processing, and we demonstrate that our algorithms provide an order of magnitude improvement over non-incremental approaches to classification on a variety of data sets, such as the Citeseer and DBLife.