A Case Restoration Approach to Named Entity Tagging in Degraded Documents

Authors:
Rohini K. Srihari;Cheng Niu;Wei Li;Jihong Ding
Affiliations:
-;-;-;-
Venue:
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Year:
2003

Citing 7
Cited 0

Language representation

Survey of the state of the art in human language technology
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Robust information extraction from automatically generated speech transcriptions

Speech Communication - Special issue on accessing information in spoken audio
Maximum entropy models for natural language ambiguity resolution

Maximum entropy models for natural language ambiguity resolution
A question answering system supported by information extraction

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A hybrid approach for named entity and sub-type tagging

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Teaching a weaker classifier: named entity recognition on upper case text

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a novel approach to namedentity (NE) tagging on degraded documents. NE taggingis the process of identifying salient text strings inunstructured text, corresponding to names of people,places, organizations, times/dates, etc. Although NEtagging is typically part of a larger informationextraction process, it has other applications, such asimproving search in an information retrieval system, andpost-processing the results of an OCR system. We focuson degraded documents, i.e. case insensitive documentsthat lack orthographic information. Examples includeoutput of speech recognition systems, as well as e-mail.The traditional approach involves retraining an NEtagger on degraded text, a cumbersome operation. Thispaper describes an approach whereby text is first"restored" to its implicit case sensitive form, andsubsequently processed by the original NE tagger.Results show that this new approach leads to far lessprecision loss in NE tagging of degraded documents.