Fast Identification of Stop Words for Font Learning and Keyword Spotting

Authors:
Tin Kam Ho
Affiliations:
-
Venue:
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Year:
1999

Citing 0
Cited 1

DIGIMIMIR: A Tool for Rapid Situation Analysis of Helpdesk and Support Email

LISA '04 Proceedings of the 18th USENIX conference on System administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

A recently proposed adaptive strategy for text recognition uses a linguistic fact that over half of the words on a typical English page are among 150 common stop words. The small lexicon permits word-shape based recognition that yields word identities from which character prototypes can be extracted.This paper describes a fast procedure for locating the best candidates for those stop words. The procedure uses width statistics of individual words and their immediate neighbors. In an experiment using 400 page images, the method removed 63% of the words from consideration. The stop/non-stop word discrimination also assists keyword spotting for information retrieval.