The effects of noisy data on text retrieval
Journal of the American Society for Information Science
Little words can make a big difference for text classification
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Language determination: natural language processing from scanned document images
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Modeling content identification from document images
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Content characterization using word shape tokens
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
How to read less and know more: approximate OCR for Thai
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval of machine-printed Latin documents through Word Shape Coding
Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding
Pattern Recognition
Hi-index | 0.00 |
We have developed a technique that categorizes document images based on their content. Unlike conventional methods that use optical character recognition (OCR), we convert document images into word shape takens, a shape-based representation of words. Because we have only to recognize simple graphical features from image, this process is much faster than OCR. Although the mapping between word shape tokens and words is one-to-many, they are a rich source of information for content characterization. Using a vector space classifier with a scanned document image database, we show that the word shape token-based approach is quite adequate for content-oriented categorization in terms of accuracy compared with conventional OCR-based approaches.