Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters

Authors:
Hung-Chih Yang;D. Stott Parker
Affiliations:
UCLA Computer Science Department,;UCLA Computer Science Department,
Venue:
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Year:
2009

Citing 11
Cited 2

Multi-table joins through bitmapped join indices

ACM SIGMOD Record
Extendible hashing—a fast access method for dynamic files

ACM Transactions on Database Systems (TODS)
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Generalized Search Trees for Database Systems

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
B-tree indexes, interpolation search, and skew

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
The Holodex: Integrating Summarization with the Index

SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data

CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The search engines that index the World Wide Web today use access methods based primarily on scanning, sorting, hashing, and partitioning (SSHP) techniques. The MapReduce framework is a distinguished example. Unlike DBMS, this search engine infrastructure provides few general tools for indexing user datasets. In particular, it does not include order-preserving tree indexes, even though they might have been built using such indexing components. Thus, data processing on these infrastructures is linearly scalable at best, while index-based techniques can be logarithmically scalable. DBMS have been using indexes to improve performance, especially on low-selectivity queries and joins. Therefore, it is natural to incorporate indexing into search-engine infrastructure. Recently, we proposed an extension of MapReduce called Map-Reduce-Merge to efficiently join heterogeneous datasets and executes relational algebra operations. Its vision was to extend search engine infrastructure so as to permit generic relational operations, expanding the scope of analysis of search engine content. In this paper we advocate incorporating yet another database primitive, indexing, into search engine data processing. We explore ways to build tree indexes using Hadoop MapReduce. We also incorporate a new primitive, Traverse , into the Map-Reduce-Merge framework. It can efficiently traverse index files, select data partitions, and limit the number of input partitions for a follow-up step of map, reduce, or merge.