Document Ranking by Layout Relevance

Authors:
May Huang;Daniel DeMenthon;David Doermann;Lynn Golebiowski;Booz Allen Hamilton
Affiliations:
University of Maryland;University of Maryland;University of Maryland;134 National Business Parkway, Annapolis Junction,MD;134 National Business Parkway, Annapolis Junction,MD
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 8
Cited 2

An Efficiently Computable Metric for Comparing Polygonal Shapes

IEEE Transactions on Pattern Analysis and Machine Intelligence
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Approximate range searching

Computational Geometry: Theory and Applications
Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
The Truth about Corel - Evaluation in Image Retrieval

CIVR '02 Proceedings of the International Conference on Image and Video Retrieval
WISDOM++: An Interactive and Adaptive Document Analysis System

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Document Image Layout Comparison and Classification

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Automated layout recognition

Proceedings of the 1st ACM workshop on Hardcopy document processing

Document identification using shape trees

International Journal of Hybrid Intelligent Systems
Exploring digital libraries with document image retrieval

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes the development of a new document ranking system based on layout similarity. The user has a need represented by a set of "wanted" documents, and the system ranks documents in the collection according to this need. Rather than performing complete document analysis, the system extracts text lines, and models layouts as relationships between pairs of these lines. This paper explores three novel feature sets to support scoring in large document collections. First, pairs of lines are used to form quadrilaterals, which are represented by their turning functions. A non- Euclidean distance is used to measure similarity. Second, the quadrilaterals are represented by 5D Euclidean vectors, and third, each line is represented by a 5D Euclidean vector. We compare the classification performance and computation speed of these three feature sets using a large database of diverse documents including forms, academic papers and handwritten pages in English and Arabic. The approach using quadrilaterals and turning functions produces slightly better results, but the approach using vectors to represent text lines is much faster for large document databases.