An indexed full-text search method of printed document images with an M-tree

Authors:
Hajime Imura;Yuzuru Tanaka
Affiliations:
Technology Hokkaido University, Sapporo, Japan;Technology Hokkaido University, Sapporo, Japan
Venue:
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Year:
2010

Citing 9
Cited 0

R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
An OCR based on character shape codes and lexical information

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Content-Based Indexing and Retrieval Method of Chinese Document Images

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Indexing multi-dimensional time-series with support for multiple distance measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
Exact indexing of dynamic time warping

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Locality Sensitive Pseudo-Code for Document Images

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Compression and String Matching Method for Printed Document Images

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an indexed full-text search method of printed document images for the occurrences of a specified character string image. It is based on N-gram-based indexing with an M-tree index structure. It is important to facilitate a full-text search method of historical letterpress printing collections to be able to deal with them. The proposed full-text search method is independent of difference of languages and fonts because it uses a pseudo-coding scheme that is based on the statistical features of character shapes. Conventional Word Spotting methods need a sequential scan of the whole document image and a matching calculation of the whole descriptor sequence of a document. The proposed N-gram-based indexing method accelerates the search process with an M-tree. Our method was evaluated in terms of its search time and of recall-precision curve for N-gram-based query strings. Our experiments demonstrated that the proposed approach achieves search times that are one hundred times faster improvement about search time.