GPText: Greenplum parallel statistical text analysis framework

  • Authors:
  • Kun Li;Christan Grant;Daisy Zhe Wang;Sunny Khatri;George Chitouras

  • Affiliations:
  • University of Florida, Gainesville, FL;University of Florida, Gainesville, FL;University of Florida, Gainesville, FL;Greenplum, San Mateo, CA;Greenplum, San Mateo, CA

  • Venue:
  • Proceedings of the Second Workshop on Data Analytics in the Cloud
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many companies keep large amounts of text data inside of relational databases. Several challenges exist in using state-of-the-art systems to perform analysis on such datasets. First, expensive big data transfer cost must be paid up front to move data between databases and analytics systems. Second, many popular text analytics packages do not scale up to production sized datasets. In this paper, we introduce GPText, Greenplum parallel statistical text analysis framework that addresses the above problems by supporting statistical inference and learning algorithms natively in a massively parallel processing database system. GPText seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database analytics which can be installed on Post-greSQL and Greenplum. In addition, GPText also developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the performance and scalability of the parallel CRF implementation. Finally, we describe an eDiscovery application built on the GPText framework.