Mining relational structure from millions of books: position paper

Authors:
David A. Smith;R. Manmatha;James Allan
Affiliations:
University of Massachusetts, Amherst, Amherst, MA, USA;University of Massachusetts, Amherst, Amherst, MA, USA;University of Massachusetts, Amherst, Amherst, MA, USA
Venue:
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Year:
2011

Citing 9
Cited 2

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

BooksOnline'11: 4th workshop on online books, complementary social media, and crowdsourcing

Proceedings of the 20th ACM international conference on Information and knowledge management
Report on BooksOnline'11: 4th workshop on online books, complementary social media, and crowdsourcing

ACM SIGIR Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

Existing large-scale scanned book collections have many shortcomings for data-driven research, from OCR of variable quality to the lack of accurate descriptive and structural metadata. We argue that complementary research in inferring relational metadata is important in its own right to support use of these collections and that it can help to mitigate other problems with scanned book collections.