Full-text indexing for optimizing selection operations in large-scale data analytics

  • Authors:
  • Jimmy Lin;Dmitriy Ryaboy;Kevin Weil

  • Affiliations:
  • Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA

  • Venue:
  • Proceedings of the second international workshop on MapReduce and its applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one inefficient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.