Effective text extraction and recognition for WWW images

Authors:
Jun Sun;Zhulong Wang;Hao Yu;Fumihito Nishino;Yukata Katsuyama;Satoshi Naoi
Affiliations:
Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu Laboratories LTD, Nakahara-ku, Kawasaki, Japan;Fujitsu Laboratories LTD, Nakahara-ku, Kawasaki, Japan
Venue:
Proceedings of the 2003 ACM symposium on Document engineering
Year:
2003

Citing 7
Cited 3

An introduction to digital image processing

An introduction to digital image processing
TextFinder: An Automatic System to Detect and Recognize Text In Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Caption Localization in Compressed Video

IEEE Transactions on Pattern Analysis and Machine Intelligence
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Locating and Recognizing Text in WWW Images

Information Retrieval
Spatial and Feature Space Clustering: Applications in Image Analysis

CAIP '95 Proceedings of the 6th International Conference on Computer Analysis of Images and Patterns
Localizing and segmenting text in images and videos

IEEE Transactions on Circuits and Systems for Video Technology

Usage derived recommendations for a video digital library

Journal of Network and Computer Applications
Text extraction from images captured via mobile and digital devices

International Journal of Computational Vision and Robotics
Live television in a digital library

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the image are first extracted by color clustering and connected component analysis. A novel stroke verification algorithm is used to effectively remove non-character strokes. The verified strokes are then used to build the binary text line image, which is segmented and recognized by dynamic programming. Since text in WWW image usually has close relationship with webpage content, approximate string matching is used to revise the recognition result by matching the content in the webpage with the content in the image. This effective post-processing not only improves the recognition performance, but also can be used in other applications such like image - webpage paragraph corresponding.