Summarizing figures, tables, and algorithms in scientific publications to augment search results

Authors:
Sumit Bhatia;Prasenjit Mitra
Affiliations:
Pennsylvania State University, PA;Pennsylvania State University, PA
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2012

Citing 26
Cited 2

A trainable document summarizer

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Natural-language retrieval of images based on descriptive captions

ACM Transactions on Information Systems (TOIS)
Advantages of query biased summaries in information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Summarizing text documents: sentence selection and evaluation metrics

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Advances in Automatic Text Summarization

Advances in Automatic Text Summarization
Training Support Vector Machines: an Application to Face Detection

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Learning with progressive transductive support vector machine

Pattern Recognition Letters
A task-oriented study on the influencing effects of query-biased summarisation in web searching

Information Processing and Management: an International Journal
Probability Estimates for Multi-class Classification by Pairwise Coupling

The Journal of Machine Learning Research
Complex spatio-temporal pattern queries

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Associating Text and Graphics for Scientific Chart Understanding

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Meta-data indexing for XPath location steps

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Exploring and exploiting the limited utility of captions in recognizing intention in information graphics

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
TableSeer: automatic table metadata extraction and searching in digital libraries

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
BioText Search Engine

Bioinformatics
Finding and using journal-article components: Impacts of disaggregation on teaching and research practice

Journal of the American Society for Information Science and Technology
An effective sentence-extraction technique using contextual information and statistical approaches for text summarization

Pattern Recognition Letters
Introduction to Information Retrieval

Introduction to Information Retrieval
Multi-document summarization by sentence extraction

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization
Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Generating synopses for document-element search

Proceedings of the 18th ACM conference on Information and knowledge management
The automatic creation of literature abstracts

IBM Journal of Research and Development
Finding algorithms in scientific articles

Proceedings of the 19th international conference on World wide web
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Improving algorithm search using the algorithm co-citation network

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
"Building a search engine for algorithms" by Suppawong Tuarob, Prasenjit Mitra, and C. Lee Giles with Martin Vesely as coordinator

ACM SIGWEB Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short “synopsis” of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.