Collection statistics for fast duplicate document detection

Authors:
Abdur Chowdhury;Ophir Frieder;David Grossman;Mary Catherine McCabe
Affiliations:
Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2002

Citing 9
Cited 83

Discrimination of authorship using visualization

Information Processing and Management: an International Journal
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A new method of weighting query terms for ad-hoc retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A vector space model for automatic indexing

Communications of the ACM
Modern Information Retrieval

Modern Information Retrieval
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
The Smart/Empire TIPSTER IR system

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998

Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Extracting unstructured data from template generated web documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Constructing a text corpus for inexact duplicate detection

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Near-duplicate detection for eRulemaking

dg.o '05 Proceedings of the 2005 national conference on Digital government research
Analysis of source identified text corpora: exploring the statistics of the reused text and authorship

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Local sparsity control for naive Bayes with extreme misclassification costs

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Phishing Webpage Detection

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
The methodology and an application to fight against Unicode attacks

SOUPS '06 Proceedings of the second symposium on Usable privacy and security
Next steps in near-duplicate detection for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Dynamic test collections: measuring search effectiveness on the live web

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance (EMD)

IEEE Transactions on Dependable and Secure Computing
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
A cost-effective method for detecting web site replicas on search engine databases

Data & Knowledge Engineering
Result merging methods in distributed information retrieval with overlapping databases

Information Retrieval
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
Identifying synonymous concepts in preparation for technology mining

Journal of Information Science
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Generating links by mining quotations

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
Trusting spam reporters: A reporter-based reputation system for email filtering

ACM Transactions on Information Systems (TOIS)
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Frequent Itemset Mining for Clustering Near Duplicate Web Documents

ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Near-duplicate detection for web-forums

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
Differences and identities in document retrieval in an annotation environment

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Navigating among search results: an information content approach

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Facilitating interaction and retrieval for annotated documents

International Journal of Computational Science and Engineering
Detecting duplicate web documents using clickthrough data

Proceedings of the fourth ACM international conference on Web search and data mining
Detecting near-duplicate relations in user generated forum content

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Content-driven detection of campaigns in social media

Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora

Proceedings of the 20th ACM international conference on Information and knowledge management
Text mining and probabilistic language modeling for online review spam detection

ACM Transactions on Management Information Systems (TMIS)
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Enhancing duplicate collection detection through replica boundary discovery

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Near-Duplicate mail detection based on URL information for spam filtering

ICOIN'06 Proceedings of the 2006 international conference on Information Networking: advances in Data Communications and Wireless Networks
Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management

Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Proceedings of the 15th International Conference on Extending Database Technology
A system for the proactive, continuous, and efficient collection of digital forensic evidence

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
A webpage deletion algorithm based on hierarchical filtering

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
Reducing information redundancy in search results

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Near duplicate detection in an academic digital library

Proceedings of the 2013 ACM symposium on Document engineering
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams

Proceedings of the VLDB Endowment
Campaign extraction from social media

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new algorithm for duplicate document detection thatuses collection statistics. We compare our approach with thestate-of-the-art approach using multiple collections. Thesecollections include a 30 MB 18,577 web document collectiondeveloped by Excite@Home and three NIST collections. The first NISTcollection consists of 100 MB 18,232 LA-Times documents, which isroughly similar in the number of documents to theExcite&at;Home collection. The other two collections are both 2GB and are the 247,491-web document collection and the TREC disks 4and 5---528,023 document collection. We show that our approachcalled I-Match, scales in terms of the number of documents andworks well for documents of all sizes. We compared our solution tothe state of the art and found that in addition to improvedaccuracy of detection, our approach executed in roughly one-fifththe time.