Identifying and Merging Related Bibliographic Records

  • Authors:
  • J. A. Hylton

  • Affiliations:
  • -

  • Venue:
  • Identifying and Merging Related Bibliographic Records
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

Bibliographic records freely available on the Internet can be used to construct a high-quality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an $n$-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet.