Language determination: natural language processing from scanned document images

Authors:
Penelope Sibun;A. Lawrence Spitz
Affiliations:
Fuji Xerox Palo Alto Laboratory, Palo Alto, CA;Fuji Xerox Palo Alto Laboratory, Palo Alto, CA
Venue:
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Year:
1994

Citing 2
Cited 9

The automatic identification of languages using linguistic recognition signals

The automatic identification of languages using linguistic recognition signals
Content characterization using word shape tokens

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2

Querying across languages: a dictionary-based approach to multilingual information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Script Identification in Printed Bilingual Documents

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Gabor Filter Based Multi-class Classifier for Scanned Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Identifying, the coding system and language, of on-line documents on the Internet

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Content-oriented categorization of document images

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Recognition assistance treating errors in texts acquired from various recognition processes

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Identifying Script onWord-Level with Informational Confidenc

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
A formal model of learning object metadata

EC-TEL'06 Proceedings of the First European conference on Technology Enhanced Learning: innovative Approaches for Learning and Knowledge Sharing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficient for many NLP tasks. In this paper, we show that the representation is sufficient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall.