Distributed indexing of web scale datasets for the cloud

Authors:
Ioannis Konstantinou;Evangelos Angelou;Dimitrios Tsoumakos;Nectarios Koziris
Affiliations:
National Technical University of Athens;National Technical University of Athens;National Technical University of Athens;National Technical University of Athens
Venue:
Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Year:
2010

Citing 8
Cited 3

The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Leveraging a scalable row store to build a distributed text index

Proceedings of the first international workshop on Cloud data management
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment

Enhancing query support in HBase Via An Extended Coprocessors Framework

ServiceWave'11 Proceedings of the 4th European conference on Towards a service-based internet
Serving large-scale batch computed data with project Voldemort

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a distributed architecture for indexing and serving large and diverse datasets. It incorporates and extends the functionality of Hadoop, the open source MapReduce framework, and of HBase, a distributed, sparse, NoSQL database, to create a fully parallel indexing system. Experiments with structured, semi-structured and unstructured data of various sizes demonstrate the flexibility, speed and robustness of our implementation and contrast it with similarly oriented projects. Our 11 node cluster prototype managed to keep full-text indexing time of 150GB raw content in less than 3 hours, whereas the system's response time under sustained query load of more than 1000 queries/sec was kept in the order of milliseconds.