Summarization of noisy documents: a pilot study

Authors:
Hongyan Jing;Daniel Lopresti;Chilin Shih
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY;Hopewell, NJ;Berkeley Heights, NJ
Venue:
HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Year:
2003

Citing 10
Cited 0

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
Effects of OCR errors on ranking and feedback using the vector space model

Information Processing and Management: an International Journal
Summarization of imaged documents without OCR

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Cut-and-paste text summarization

Cut-and-paste text summarization
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Named entity extraction from noisy input: speech and OCR

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Improving information extraction by modeling errors in speech recognizer output

HLT '01 Proceedings of the first international conference on Human language technology research
Some applications of tree-based modelling to speech and language

HLT '89 Proceedings of the workshop on Speech and Natural Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the problem of summarizing text documents that contain errors as a result of optical character recognition. Each stage in the process is tested, the error effects analyzed, and possible solutions suggested. Our experimental results show that current approaches, which are developed to deal with clean text, suffer significant degradation even with slight increases in the noise level of a document. We conclude by proposing possible ways of improving the performance of noisy document summarization.