Modeling content identification from document images

Authors:
Takehiro Nakayama
Affiliations:
Fuji Xerox Palo Alto Laboratory, Palo Alto, CA
Venue:
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Year:
1994

Citing 5
Cited 4

Guest Editor's Introduction: Document Image Analysis Systems

Computer
Constant interaction-time scatter/gather browsing of very large document collections

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Joining statistics with NLP for text categorization

ANLC '92 Proceedings of the third conference on Applied natural language processing
Content characterization using word shape tokens

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2

Content-oriented categorization of document images

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Recognition assistance treating errors in texts acquired from various recognition processes

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new technique to locate content-representing words for a given document image using abstract representation of character shapes is described. A character shape code representation defined by the location of a character in a text line has been developed. Character shape code generation avoids the computational expense of conventional optical character recognition (OCR). Because character shape codes are an abstraction of standard character code (e.g., ASCII), the mapping is ambiguous. In this paper, the ambiguity is shown to be practically limited to an acceptable level. It is illustrated that: first, punctuation marks are clearly distinguished from the other characters; second, stop words are generally distinguishable from other words, because the permutations of character shape codes in function words are characteristically different from those in content words; and third, numerals and acronyms in capital letters are distinguishable from other words. With these classifications, potential content-representing words are identified, and an analysis of their distribution yields their rank. Consequently, introducing character shape codes makes it possible to inexpensively and robustly bridge the gap between electronic documents and hard-copy documents for the purpose of content identification.