An Efficiently Computable Metric for Comparing Polygonal Shapes
IEEE Transactions on Pattern Analysis and Machine Intelligence
The indexing and retrieval of document images: a survey
Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Computational Geometry: Theory and Applications
Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
The Truth about Corel - Evaluation in Image Retrieval
CIVR '02 Proceedings of the International Conference on Image and Video Retrieval
WISDOM++: An Interactive and Adaptive Document Analysis System
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Document Image Layout Comparison and Classification
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Proceedings of the 1st ACM workshop on Hardcopy document processing
Document identification using shape trees
International Journal of Hybrid Intelligent Systems
Exploring digital libraries with document image retrieval
ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Hi-index | 0.01 |
This paper describes the development of a new document ranking system based on layout similarity. The user has a need represented by a set of "wanted" documents, and the system ranks documents in the collection according to this need. Rather than performing complete document analysis, the system extracts text lines, and models layouts as relationships between pairs of these lines. This paper explores three novel feature sets to support scoring in large document collections. First, pairs of lines are used to form quadrilaterals, which are represented by their turning functions. A non- Euclidean distance is used to measure similarity. Second, the quadrilaterals are represented by 5D Euclidean vectors, and third, each line is represented by a 5D Euclidean vector. We compare the classification performance and computation speed of these three feature sets using a large database of diverse documents including forms, academic papers and handwritten pages in English and Arabic. The approach using quadrilaterals and turning functions produces slightly better results, but the approach using vectors to represent text lines is much faster for large document databases.