Searching online book documents and analyzing book citations

Authors:
Zhaohui Wu;Sujatha Das;Zhenhui Li;Prasenjit Mitra;C. Lee Giles
Affiliations:
Computer Science and Engineering, Pennsylvania State University, State College, PA, USA;Computer Science and Engineering, Pennsylvania State University, State College, PA, USA;Information Sciences and Technology, Pennsylvania State University, State College, PA, USA;Information Sciences and Technology, Pennsylvania State University, State College, PA, USA;Information Sciences and Technology, Pennsylvania State University, State College, PA, USA
Venue:
Proceedings of the 2013 ACM symposium on Document engineering
Year:
2013

Citing 23
Cited 1

CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Persistence of information on the web: analyzing citations contained in research articles

Proceedings of the ninth international conference on Information and knowledge management
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Structuring documents according to their table of contents

Proceedings of the 2005 ACM symposium on Document engineering
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Automatic and versatile publications ranking for research institutions and scholars

Communications of the ACM - Smart business networks
Book search: indexing the valuable parts

Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
On tables of contents and how to recognize them

International Journal on Document Analysis and Recognition
Google book search: Citation analysis for social science and the humanities

Journal of the American Society for Information Science and Technology
Analysis of Book Documents' Table of Content Based on Clustering

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Popularity weighted ranking for academic digital libraries

ECIR'07 Proceedings of the 29th European conference on IR research
Book search experiments: investigating IR methods for the indexing and retrieval of books

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Citing for high impact

Proceedings of the 10th annual joint conference on Digital libraries
Table of contents recognition for converting PDF documents in e-book formats

Proceedings of the 10th ACM symposium on Document engineering
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Structure extraction from PDF-based book documents

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Assessing the citation impact of books: The role of Google Books, Google Scholar, and Scopus

Journal of the American Society for Information Science and Technology
Challenges in generating bookmarks from TOC entries in e-books

Proceedings of the 2012 ACM symposium on Document engineering
The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

Proceedings of the 3rd Annual ACM Web Science Conference
Social book search: comparing topical relevance judgements and book suggestions for evaluation

Proceedings of the 21st ACM international conference on Information and knowledge management

Can back-of-the-book indexes be automatically created?

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.