The automatic identification of languages using linguistic recognition signals
The automatic identification of languages using linguistic recognition signals
Content characterization using word shape tokens
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Querying across languages: a dictionary-based approach to multilingual information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Script Identification in Printed Bilingual Documents
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Gabor Filter Based Multi-class Classifier for Scanned Document Images
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Identifying, the coding system and language, of on-line documents on the Internet
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Content-oriented categorization of document images
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Recognition assistance treating errors in texts acquired from various recognition processes
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Identifying Script onWord-Level with Informational Confidenc
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
A formal model of learning object metadata
EC-TEL'06 Proceedings of the First European conference on Technology Enhanced Learning: innovative Approaches for Learning and Knowledge Sharing
Hi-index | 0.00 |
Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficient for many NLP tasks. In this paper, we show that the representation is sufficient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall.