Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents

Authors:
Saurabh Kataria;William Browuer;Prasenjit Mitra;C. Lee Giles
Affiliations:
Pennsylvania State University, University Park, PA;Pennsylvania State University, University Park, PA;Pennsylvania State University, University Park, PA;Pennsylvania State University, University Park, PA
Venue:
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Year:
2008

Citing 16
Cited 6

A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Classification of newspaper image blocks using texture analysis

Computer Vision, Graphics, and Image Processing
Texture Features for Browsing and Retrieval of Image Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
VisualSEEk: a fully automated content-based image query system

MULTIMEDIA '96 Proceedings of the fourth ACM international conference on Multimedia
A Generic System for Form Dropout

IEEE Transactions on Pattern Analysis and Machine Intelligence
Finding text in images

DL '97 Proceedings of the second ACM international conference on Digital libraries
Document Representation and Its Application to Page Decomposition

IEEE Transactions on Pattern Analysis and Machine Intelligence
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Practical algorithms for image analysis: description, examples, and code

Practical algorithms for image analysis: description, examples, and code
Use of the Hough transformation to detect lines and curves in pictures

Communications of the ACM
Locating text in complex color images

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Text Extraction from Gray Scale Document Images Using Edge Information

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Content-based image retrieval: approaches and trends of the new age

Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval
Segregating and extracting overlapping data points in two-dimensional plots

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Tablerank: a ranking algorithm for table search and retrieval

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1

Generating synopses for document-element search

Proceedings of the 18th ACM conference on Information and knowledge management
oreChem ChemXSeer: a semantic digital library for chemistry

Proceedings of the 10th annual joint conference on Digital libraries
An algorithm search engine for software developers

Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation
Patent image retrieval: a survey

Proceedings of the 4th workshop on Patent information retrieval
Summarizing figures, tables, and algorithms in scientific publications to augment search results

ACM Transactions on Information Systems (TOIS)
AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.