A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Authors:
Fei Chen;Meichun Hsu
Affiliations:
HP Labs;HP Labs
Venue:
Proceedings of the 16th International Conference on Extending Database Technology
Year:
2013

Citing 24
Cited 0

Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The TEXTURE benchmark: measuring performance of text queries on a relational DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Information extraction from Wikipedia: moving down the long tail

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Information Extraction

Foundations and Trends in Databases
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
An Algebraic Approach to Rule-Based Information Extraction

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
Querying probabilistic information extraction

Proceedings of the VLDB Endowment
Massively parallel data analysis with PACTs on Nephele

Proceedings of the VLDB Endowment
Tuffy: scaling up statistical inference in Markov logic networks using an RDBMS

Proceedings of the VLDB Endowment
Hybrid in-database inference for declarative information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing analytic data flows for multiple execution engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
Can the elephants handle the NoSQL onslaught?

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text analytics has become increasingly important with the rapid growth of text data. Particularly, information extraction (IE), which extracts structured data from text, has received significant attention. Unfortunately, IE is often computationally intensive. To address this issue, MapReduce has been used for large scale IE. Recently, there are emerging efforts from both academia and industry on pushing IE inside DBMSs. This leads to an interesting and important question: Given that both MapReduce and parallel DBMSs are for large scale analytics, which platform is a better choice for large scale IE? In this paper, we propose a benchmark to systematically study the performance of both platforms for large scale IE tasks. The benchmark includes both statistical learning based and rule based IE programs, which have been extensively used in real-world IE tasks. We show how to express these programs on both platforms and conduct experiments on real-world datasets. Our results show that parallel DBMSs is a viable alternative for large scale IE.