Content characterization using word shape tokens

Authors:
Penelope Sibun;David S. Farrar
Affiliations:
Fuji Xerox Palo Alto Laboratory, Palo Alto, CA;Fuji Xerox Palo Alto Laboratory, Palo Alto, CA
Venue:
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Year:
1994

Citing 1
Cited 3

A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing

Language determination: natural language processing from scanned document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Modeling content identification from document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Content-oriented categorization of document images

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

By quickly classifying character images into character shape categories, it is possible to automatically extract syntactic information from the text of document images without optical character recognition. Using word shape tokens composed of these character shape codes, a properly trained text tagger can extract part-of-speech information from scanned document images. Later components of a document processing system can then use this information to locate topics, characterize document style, and assist in information retrieval.