Information Retrieval from Documents: A Survey

Authors:
M. Mitra;B. B. Chaudhuri
Affiliations:
Indian Statistical Institute, Calcutta. mandar@isical.ac.in;Indian Statistical Institute, Calcutta. bbc@isical.ac.in
Venue:
Information Retrieval
Year:
2000

Citing 62
Cited 22

Retrieval techniques

Annual review of information science and technology, vol. 22
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Boolean operations

Information retrieval
Extended Boolean models

Information retrieval
Color indexing

International Journal of Computer Vision
Knowledge-Directed Interpretation of Mechanical Engineering Drawings

IEEE Transactions on Pattern Analysis and Machine Intelligence
Celesstin: CAD Conversion of Mechanical Drawings

Computer
A word shape analysis approach to lexicon based word recognition

Pattern Recognition Letters
Intelligent forms processing system

Machine Vision and Applications - Special issue: document image analysis techniques
Query expansion using lexical-semantic relations

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Incorporation of a Markov model of language syntax in a text recognition algorithm

Document image analysis
CLARIT-TREC experiments

TREC-2 Proceedings of the second conference on Text retrieval conference
Indexing handwriting using word matching

Proceedings of the first ACM international conference on Digital libraries
Font and function word identification in document recognition

Computer Vision and Image Understanding
Periodicity, Directionality, and Randomness: Wold Features for Image Modeling and Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
MARCO: MAp Retrieval by COntent

IEEE Transactions on Pattern Analysis and Machine Intelligence
Texture Features for Browsing and Retrieval of Image Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
Query expansion using local and global document analysis

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Querying across languages: a dictionary-based approach to multilingual information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Applying algebraic and differential invariants for logo recognition

Machine Vision and Applications
Phrasal translation and query expansion techniques for cross-language information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Word spotting: indexing handwritten manuscripts

Intelligent multimedia information retrieval
NETRA: a toolbox for navigating large image databases

NETRA: a toolbox for navigating large image databases
Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Summarization of imaged documents without OCR

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Comparing images using joint histograms

Multimedia Systems - Special issue on video content based retrieval
A vector space model for automatic indexing

Communications of the ACM
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Content-Based Image Retrieval Systems

Computer
Query by Image and Video Content: The QBIC System

Computer
Automatic Indexing and Content-Based Retrieval of Captioned Images

Computer
Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Logo and Word Matching Using a General Approach to Signal Registration

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Document image similarity and equivalence detection

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Using Character Shape Coding for Information Retrieval

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
An Approximate String Match for Garbled Text with Various Accuracy

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Retrieval methods for English-text with missrecognized OCR characters

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Document image database retrieval and browsing using texture analysis

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Image Categorization Using Texture Features

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Measuring the Effects of OCR Errors on Similarity Linking

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
The Retrieval of Document Images: A Brief Survey

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
The Detection of Duplicates in Document Image Databases

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Cross-Language Information Retrieval in a Multilingual Legal Domain

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Extraction of Indicative Summary Sentences from Imaged Documents

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Models and algorithms for efficient color image indexing

CAIVL '97 Proceedings of the 1997 Workshop on Content-Based Access of Image and Video Libraries (CBAIVL '97)
Image Indexing Using Color Correlograms

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Word Spotting: A New Approach to Indexing Handwriting

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Structural Compression for Documents Analysis

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Clustering OCR-ed texts for browsing document image database

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Dienst: Implementation Reference Manual

Dienst: Implementation Reference Manual
Experiments in Multi-Lingual Information Retrieval

Experiments in Multi-Lingual Information Retrieval
Adaptive vector space text filtering for monolingual and cross-language application

Adaptive vector space text filtering for monolingual and cross-language application
Color-spatial image indexing and applications

Color-spatial image indexing and applications
A blueprint for automatic indexing

ACM SIGIR Forum
The Text REtrieval Conferences (TRECs)

TIPSTER '96 Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998

Retrieval by Layout Similarity of Documents Represented with MXY Trees

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
New Challenges for Cross-Language Information Retrieval: Multimedia Data and the User Experience

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
Identification of common methods used for ontology integration tasks

Proceedings of the first international workshop on Interoperability of heterogeneous information systems
Hangul Document Image Retrieval System Using Rank-based Recognitio

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
An Approach for Stemming in Symbolically Compressed Indian Language Imaged Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Font Adaptive Word Indexing of Modern Printed Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
A survey and classification of semantic search approaches

International Journal of Metadata, Semantics and Ontologies
Effectively Searching Maps in Web Documents

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Scalable indexing for layout based document retrieval and ranking

Proceedings of the 2010 ACM Symposium on Applied Computing
A kernel-based approach to document retrieval

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Decomposing background topics from keywords by principal component pursuit

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A method for user profile adaptation in document retrieval

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part II
Comparative information retrieval evaluation for scanned documents

Proceedings of the 15th WSEAS international conference on Computers
Event log mining tool for large scale HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Improved stable retrieval in noisy collections

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Enabling search over large collections of telugu document images – an automatic annotation based approach

ICVGIP'06 Proceedings of the 5th Indian conference on Computer Vision, Graphics and Image Processing
Efficient word retrieval by means of SOM clustering and PCA

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Exploring digital libraries with document image retrieval

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Keyword spotting in unconstrained handwritten Chinese documents using contextual word model

Image and Vision Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods.Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.