Content-oriented categorization of document images

Authors:
Takehiro Nakayama
Affiliations:
FX Palo Alto Laboratory, Inc., Palo Alto, CA
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Year:
1996

Citing 7
Cited 3

The effects of noisy data on text retrieval

Journal of the American Society for Information Science
Little words can make a big difference for text classification

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Language determination: natural language processing from scanned document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Modeling content identification from document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Content characterization using word shape tokens

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Fax: an alternative to SGML

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

How to read less and know more: approximate OCR for Thai

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed a technique that categorizes document images based on their content. Unlike conventional methods that use optical character recognition (OCR), we convert document images into word shape takens, a shape-based representation of words. Because we have only to recognize simple graphical features from image, this process is much faster than OCR. Although the mapping between word shape tokens and words is one-to-many, they are a rich source of information for content characterization. Using a vector space classifier with a scanned document image database, we show that the word shape token-based approach is quite adequate for content-oriented categorization in terms of accuracy compared with conventional OCR-based approaches.