Beyond myopic inference in big data pipelines

Authors:
Karthik Raman;Adith Swaminathan;Johannes Gehrke;Thorsten Joachims
Affiliations:
Cornell University, Ithaca, New York, USA;Cornell University, Ithaca, New York, USA;Cornell University, Ithaca, New York, USA;Cornell University, Ithaca, New York, USA
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 19
Cited 0

A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Word reordering and a dynamic programming beam search algorithm for statistical machine translation

Computational Linguistics
Compact representations by finite-state transducers

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Design and implementation of the UIMA common analysis system

IBM Systems Journal
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Identifying sources of opinions with conditional random fields and extraction patterns

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Proceedings of the VLDB Endowment
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
Fast and Simple Relational Processing of Uncertain Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Solving the problem of cascading errors: approximate Bayesian inference for linguistic annotation pipelines

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Learning with probabilistic features for improved pipeline models

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Scaling high-order character language models to gigabytes

Software '05 Proceedings of the Workshop on Software
Convolution kernels on constituent, dependency and sequential structures for relation extraction

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Scalable learning for object detection with GPU hardware

IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
Collective cross-document relation extraction without labelled data

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
OCR Post-processing Using Weighted Finite-State Transducers

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Text Processing with GATE

Text Processing with GATE
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Contextually guided semantic labeling and search for three-dimensional point clouds

International Journal of Robotics Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Big Data Pipelines decompose complex analyses of large data sets into a series of simpler tasks, with independently tuned components for each task. This modular setup allows re-use of components across several different pipelines. However, the interaction of independently tuned pipeline components yields poor end-to-end performance as errors introduced by one component cascade through the whole pipeline, affecting overall accuracy. We propose a novel model for reasoning across components of Big Data Pipelines in a probabilistically well-founded manner. Our key idea is to view the interaction of components as dependencies on an underlying graphical model. Different message passing schemes on this graphical model provide various inference algorithms to trade-off end-to-end performance and computational cost. We instantiate our framework with an efficient beam search algorithm, and demonstrate its efficiency on two Big Data Pipelines: parsing and relation extraction.