Effective text extraction and recognition for WWW images

  • Authors:
  • Jun Sun;Zhulong Wang;Hao Yu;Fumihito Nishino;Yukata Katsuyama;Satoshi Naoi

  • Affiliations:
  • Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu R&D Center Co., Ltd., Beijing P. R. China;Fujitsu Laboratories LTD, Nakahara-ku, Kawasaki, Japan;Fujitsu Laboratories LTD, Nakahara-ku, Kawasaki, Japan

  • Venue:
  • Proceedings of the 2003 ACM symposium on Document engineering
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the image are first extracted by color clustering and connected component analysis. A novel stroke verification algorithm is used to effectively remove non-character strokes. The verified strokes are then used to build the binary text line image, which is segmented and recognized by dynamic programming. Since text in WWW image usually has close relationship with webpage content, approximate string matching is used to revise the recognition result by matching the content in the webpage with the content in the image. This effective post-processing not only improves the recognition performance, but also can be used in other applications such like image - webpage paragraph corresponding.