Probabilistic management of OCR data using an RDBMS

Authors:
Arun Kumar;Christopher Ré
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
Proceedings of the VLDB Endowment
Year:
2011

Citing 31
Cited 2

Learning in graphical models

Learning in graphical models
Optical character recognition

Optical character recognition
Modern Information Retrieval

Modern Information Retrieval
Off-Line Handwritten Word Recognition Using a Hidden Markov Model Type Stochastic Network

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Efficient Indexing Technique for Full Text Databases

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fast Regular Expression Indexing Engine

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Finite-state transducers in language and speech processing

Computational Linguistics
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Working Models for Uncertain Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
MauveDB: supporting model-based user views in database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Event queries on correlated probabilistic streams

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Proceedings of the VLDB Endowment
Approximate lineage for probabilistic databases

Proceedings of the VLDB Endowment
Graphical Models, Exponential Families, and Variational Inference

Foundations and Trends® in Machine Learning
Relaxed maximum a posteriori fault identification

Signal Processing
Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks

Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks
Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Access Methods for Markovian Streams

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Indexing correlated probabilistic databases

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
General indexation of weighted automata: application to spoken utterance retrieval

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
OpenFst: a general and efficient weighted finite-state transducer library

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Transducing Markov sequences

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
UPI: a primary index for uncertain databases

Proceedings of the VLDB Endowment

Local structure and determinism in probabilistic databases

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Report on the first workshop on innovative querying of streams

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current state-of-the-art approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. State-of-the-art OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key technical challenge is that the probabilistic data produced by OCR software is very large (a single book blows up to 2GB from 400kB as ASCII). As a result, a baseline solution that integrates these models with an RDBMS is over 1000x slower versus standard text processing for single table select-project queries. However, many applications may have quality-performance needs that are in between these two extremes of ASCII and the complete model output by the OCR software. Thus, we propose a novel approximation scheme called Staccato that allows a user to trade recall for query performance. Additionally, we provide a formal analysis of our scheme's properties, and describe how we integrate our scheme with standard-RDBMS text indexing.