Automatic categorization of figures in scientific documents

Authors:
Xiaonan Lu;Prasenjit Mitra;James Z. Wang;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, Pennsylvania;The Pennsylvania State University, University Park, Pennsylvania;The Pennsylvania State University, University Park, Pennsylvania;The Pennsylvania State University, University Park, Pennsylvania
Venue:
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Year:
2006

Citing 25
Cited 5

A Computational Approach to Edge Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Wavelets and subband coding

Wavelets and subband coding
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Comparison of edge detectors: a methodology and initial study

Computer Vision and Image Understanding
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
Content-Based Image Retrieval at the End of the Early Years

IEEE Transactions on Pattern Analysis and Machine Intelligence
Use of the Hough transformation to detect lines and curves in pictures

Communications of the ACM
SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introductory Techniques for 3-D Computer Vision

Introductory Techniques for 3-D Computer Vision
A Region-Based Fuzzy Feature Matching Approach to Content-Based Image Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Knowledge-based derivation of document logical structure

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust document image understanding technologies

Proceedings of the 1st ACM workshop on Hardcopy document processing
A Parallel-Line Detection Algorithm Based on HMM Decoding

IEEE Transactions on Pattern Analysis and Machine Intelligence
Addressing the challenge of visual information access from digital image and video libraries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Leveraging context to resolve identity in photo albums

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
On the extraction of vocal-related information to facilitate the management of popular music collections

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Content-based image retrieval: approaches and trends of the new age

Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval
Context-based multiscale classification of document images using wavelet coefficient distributions

IEEE Transactions on Image Processing
A computationally efficient approach to the estimation of two- and three-dimensional hidden Markov models

IEEE Transactions on Image Processing

Deriving knowledge from figures for digital libraries

Proceedings of the 16th international conference on World Wide Web
ChemXSeer: a digital library and data repository for chemical kinetics

Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience
Segregating and extracting overlapping data points in two-dimensional plots

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Annotation suggestion and search for personal multimedia objects on the web

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Patent image retrieval: a survey

Proceedings of the 4th workshop on Patent information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for realworld use. Our tools will be integrated into a scientific document digital library.