Document area identification for extending books without markers

  • Authors:
  • Akihiro Miyata;Ko Fujimura

  • Affiliations:
  • NTT Corporation, Yokosuka, Japan;NTT Corporation, Yokosuka, Japan

  • Venue:
  • Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

We present a method of document area identification that utilizes consecutive characters in the non-reading direction as search keys. We use this method to develop a prototype system called Kappan. It enables service providers and users to create hyperlinks in books without markers. Existing techniques generally require markers to be printed on the page if a hyperlink is to be created. We consider that utilizing the concept of the search index makes markers unnecessary. Kappan associates indexed text areas in a large number of books with supporting digital contents. The indexed text areas, freely defined by service providers or users, are identified by subjecting images of small areas of the printed page to OCR (Optical Character Recognition) and extracting from the text so recognized highly specific and efficient search keys. Traditional text indexing methods must extract long character sequences from the partial image in order to identify the area exactly given the sheer number of book pages. However, considering that the average OCR error rate is more than 20 percent if the partial image is captured by a camera-equipped cellular phone, it is highly probable that many characters would be misrecognized and area identification would thus fail. In contrast, our indexing method can extract area-specific clues using fewer characters that can identify the area exactly even when the partial image is small and the extracted text contains misrecognized characters. An experiment proves that our method can identify the exact area from more than one million areas with the high accuracy rates of 99 percent and 96 percent for OCR error rates of 0 percent and 22 percent, respectively.