DIGIMIMIR: A Tool for Rapid Situation Analysis of Helpdesk and Support Email
LISA '04 Proceedings of the 18th USENIX conference on System administration
Hi-index | 0.00 |
A recently proposed adaptive strategy for text recognition uses a linguistic fact that over half of the words on a typical English page are among 150 common stop words. The small lexicon permits word-shape based recognition that yields word identities from which character prototypes can be extracted.This paper describes a fast procedure for locating the best candidates for those stop words. The procedure uses width statistics of individual words and their immediate neighbors. In an experiment using 400 page images, the method removed 63% of the words from consideration. The stop/non-stop word discrimination also assists keyword spotting for information retrieval.