Collection statistics for fast duplicate document detection

  • Authors:
  • Abdur Chowdhury;Ophir Frieder;David Grossman;Mary Catherine McCabe

  • Affiliations:
  • Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new algorithm for duplicate document detection thatuses collection statistics. We compare our approach with thestate-of-the-art approach using multiple collections. Thesecollections include a 30 MB 18,577 web document collectiondeveloped by Excite@Home and three NIST collections. The first NISTcollection consists of 100 MB 18,232 LA-Times documents, which isroughly similar in the number of documents to theExcite&at;Home collection. The other two collections are both 2GB and are the 247,491-web document collection and the TREC disks 4and 5---528,023 document collection. We show that our approachcalled I-Match, scales in terms of the number of documents andworks well for documents of all sizes. We compared our solution tothe state of the art and found that in addition to improvedaccuracy of detection, our approach executed in roughly one-fifththe time.