Generating links by mining quotations

Authors:
Okan Kolak;Bill N. Schilit
Affiliations:
Google Research, Mountain View, CA, USA;Google Research, Mountain View, CA, USA
Venue:
Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
Year:
2008

Citing 15
Cited 13

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
On the use of information retrieval techniques for the automatic construction of hypertext

Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Document overlap detection system for distributed digital libraries

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Inter-linker consistency in the manual construction of hypertext documents

ACM Computing Surveys (CSUR)
Automatic link generation

ACM Computing Surveys (CSUR)
Xanalogical structure, needed now more than ever: parallel documents, deep links to content, deep versioning, and deep re-use

ACM Computing Surveys (CSUR)
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Signature extraction for overlap detection in documents

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Plagiarism Detection in arXiv

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Exploring a digital library through key ideas

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries

Exploring a digital library through key ideas

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Reading in the office

Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Collecting fragmentary authors in a digital library

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
When printed hypertexts go digital: information extraction from the parsing of indices

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Citations in the digital library of classics: extracting canonical references by using conditional random fields

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Highlighting disputed claims on the web

Proceedings of the 19th international conference on World wide web
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Automatic generation of inter-passage links based on semantic similarity

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Plagiarism detection based on structural information

Proceedings of the 20th ACM international conference on Information and knowledge management
Finding and exploring memes in social media

Proceedings of the 23rd ACM conference on Hypertext and social media
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scanning books, magazines, and newspapers has become a widespread activity because people believe that much of the worlds information still resides off-line. In general after works are scanned they are indexed for search and processed to add links. This paper describes a new approach to automatically add links by mining popularly quoted passages. Our technique connects elements that are semantically rich, so strong relations are made. Moreover, link targets point within a work, facilitating navigation. This paper makes three contributions. We describe a scalable algorithm for mining repeated word sequences from extremely large text corpora. Second, we present techniques that filter and rank the repeated sequences for quotations. Third, we present a new user interface for navigating across and within works in the collection using quotation links. Our system has been run on a digital library of over 1 million books and has been used by thousands of people.