Essential deduplication functions for transactional databases in law firms

  • Authors:
  • Jack G. Conrad;Edward L. Raymond

  • Affiliations:
  • Research & Development, St. Paul, Minnesota;Content Operations, Thomson-West, Rochester, New York

  • Venue:
  • Proceedings of the 11th international conference on Artificial intelligence and law
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As massive document repositories and knowledge management systems continue to expand, in proprietary environments as well as on the Web, the need for duplicate detection becomes increasingly important. In business enterprises such as law firms, effective retrieval applications depend upon such functionality. Today's Internet-savvy users are not interested in search results containing numerous sets of duplicate documents, whether exact duplicates or near variants. This report addresses our work in the domain of legal information retrieval, working with a large, transactional knowledge management system. We specifically explore the occurrence and treatment of identical, near-identical, and fuzzy duplicate sub-documents ('clauses') in a contracts database. To our knowledge, we are the first to use principled methods to construct a test collection of transactional documents for such research purposes, one which identifies a variety of duplicate types and is deployed to establish baseline algorithmic approaches to deduplication. We subsequently investigate the application of digital signature techniques to characterize and compare similar clauses in order to identify duplicates and near duplicates. This approach establishes a baseline using methods and algorithms first developed in a parallel domain. It produces a set of promising results following an extensive assessment phase involving direct comparisons with gold training and test data created by expert attorneys working in the transactional domain.