Extracting relevant named entities for automated expense reimbursement

Authors:
Guangyu Zhu;Timothy J. Bethea;Vikas Krishna
Affiliations:
University of Maryland, College Park/ College Park, MD;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 22
Cited 2

Representations of quasi-Newton matrices and their use in limited memory methods

Mathematical Programming: Series A and B
Segmentation of page images using the area Voronoi diagram

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence
Optical Character Recognition: An Illustrated Guide to the Frontier

Optical Character Recognition: An Illustrated Guide to the Frontier
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
The Document Spectrum for Page Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Information retrieval and OCR: from converting content to grasping meaning

ACM SIGIR Forum
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to recognize tables in free text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Conditional Random Fields for Contextual Human Motion Recognition

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Named Entity Extraction using AdaBoost

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Named entity recognition through classifier combination

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Is there a grand challenge or X-prize for data mining?

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Multiscale conditional random fields for image labeling

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

Information Extraction

Foundations and Trends in Databases
Language identification for handwritten document images using a shape codebook

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Expense reimbursement is a time-consuming and labor-intensive process across organizations. In this paper, we present a prototype expense reimbursement system that dramatically reduces the elapsed time and costs involved, by eliminating paper from the process life cycle. Our complete solution involves (1) an electronic submission infrastructure that provides multi- channel image capture, secure transport and centralized storage of paper documents; (2) an unconstrained data mining approach to extracting relevant named entities from un-structured document images; (3) automation of auditing procedures that enables automatic expense validation with minimum human interaction. Extracting relevant named entities robustly from document images with unconstrained layouts and diverse formatting is a fundamental technical challenge to image-based data mining, question answering, and other information retrieval tasks. In many applications that require such capability, applying traditional language modeling techniques to the stream of OCR text does not give satisfactory result due to the absence of linguistic context. We present an approach for extracting relevant named entities from document images by combining rich page layout features in the image space with language content in the OCR text using a discriminative conditional random field (CRF) framework. We integrate this named entity extraction engine into our expense reimbursement solution and evaluate the system performance on large collections of real-world receipt images provided by IBM World Wide Reimbursement Center.