Full-text indexing for optimizing selection operations in large-scale data analytics

Authors:
Jimmy Lin;Dmitriy Ryaboy;Kevin Weil
Affiliations:
Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA
Venue:
Proceedings of the second international workshop on MapReduce and its applications
Year:
2011

Citing 16
Cited 13

Integrating SQL Databases with Content-Specific Search Engines

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
DB&IR: both sides now

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation

Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Improving the diagnosis of mild hypertrophic cardiomyopathy with MapReduce

Proceedings of third international workshop on MapReduce and its Applications Date
Distributed approximate spectral clustering for large-scale datasets

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
HadoopProv: towards provenance as a first class citizen in MapReduce

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one inefficient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.