Constant interaction-time scatter/gather browsing of very large document collections
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing
Communications of the ACM
Joining statistics with NLP for text categorization
ANLC '92 Proceedings of the third conference on Applied natural language processing
Content characterization using word shape tokens
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Content-oriented categorization of document images
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Recognition assistance treating errors in texts acquired from various recognition processes
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Retrieval of machine-printed Latin documents through Word Shape Coding
Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding
Pattern Recognition
Hi-index | 0.00 |
A new technique to locate content-representing words for a given document image using abstract representation of character shapes is described. A character shape code representation defined by the location of a character in a text line has been developed. Character shape code generation avoids the computational expense of conventional optical character recognition (OCR). Because character shape codes are an abstraction of standard character code (e.g., ASCII), the mapping is ambiguous. In this paper, the ambiguity is shown to be practically limited to an acceptable level. It is illustrated that: first, punctuation marks are clearly distinguished from the other characters; second, stop words are generally distinguishable from other words, because the permutations of character shape codes in function words are characteristically different from those in content words; and third, numerals and acronyms in capital letters are distinguishable from other words. With these classifications, potential content-representing words are identified, and an analysis of their distribution yields their rank. Consequently, introducing character shape codes makes it possible to inexpensively and robustly bridge the gap between electronic documents and hard-copy documents for the purpose of content identification.