Efficient privacy-preserving similar document detection

Authors:
Mummoorthy Murugesan;Wei Jiang;Chris Clifton;Luo Si;Jaideep Vaidya
Affiliations:
Department of Computer Science, Purdue University, West Lafayette, USA 47907;Department of Computer Science, Missouri University of Science and Technology, Rolla, USA 65409;Department of Computer Science, Purdue University, West Lafayette, USA 47907;Department of Computer Science, Purdue University, West Lafayette, USA 47907;MSIS Department, Rutgers University, Newark, USA 07102
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2010

Citing 20
Cited 7

The knowledge complexity of interactive proof-systems

STOC '85 Proceedings of the seventeenth annual ACM symposium on Theory of computing
How to play ANY mental game

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Processing encrypted data

Communications of the ACM
Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Private information retrieval

Journal of the ACM (JACM)
CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Modern Information Retrieval

Modern Information Retrieval
Executing SQL over encrypted data in the database-service-provider model

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Privacy preserving association rule mining in vertically partitioned data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A document comparison scheme for secure duplicate detection

International Journal on Digital Libraries
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Plagiarism Detection in arXiv

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
How to generate and exchange secrets

SFCS '86 Proceedings of the 27th Annual Symposium on Foundations of Computer Science
Public-key cryptosystems based on composite degree residuosity classes

EUROCRYPT'99 Proceedings of the 17th international conference on Theory and application of cryptographic techniques
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
On private scalar product computation for privacy-preserving data mining

ICISC'04 Proceedings of the 7th international conference on Information Security and Cryptology

Efficient techniques for privacy-preserving sharing of sensitive information

TRUST'11 Proceedings of the 4th international conference on Trust and trustworthy computing
N-gram based secure similar document detection

DBSec'11 Proceedings of the 25th annual IFIP WG 11.3 conference on Data and applications security and privacy
Privacy preserving group linkage

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Using structural information and citation evidence to detect significant plagiarism cases in scientific publications

Journal of the American Society for Information Science and Technology
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Verifying correctness of inner product of vectors in cloud computing

Proceedings of the 2013 international workshop on Security in cloud computing
EsPRESSO: Efficient privacy-preserving evaluation of sample set similarity

Journal of Computer Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similar document detection plays important roles in many applications, such as file management, copyright protection, plagiarism prevention, and duplicate submission detection. The state of the art protocols assume that the contents of files stored on a server (or multiple servers) are directly accessible. However, this makes such protocols unsuitable for any environment where the documents themselves are sensitive and cannot be openly read. Essentially, this assumption limits more practical applications, e.g., detecting plagiarized documents between two conferences, where submissions are confidential. We propose novel protocols to detect similar documents between two entities where documents cannot be openly shared with each other. The similarity measure used can be a simple cosine similarity on entire documents or on document fragments, enabling detection of partial copying. We conduct extensive experiments to show the practical value of the proposed protocols. While the proposed base protocols are much more efficient than the general secure multiparty computation based solutions, they are still slow for large document sets. We then investigate a clustering based approach that significantly reduces the running time and achieves over 90% of accuracy in our experiments. This makes secure similar document detection both practical and feasible.