Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Communications of the ACM
MatchDetectReveal: finding overlapping and similar digital documents
Proceedings of the 2000 information resources management association international conference on Challenges of information technology management in the 21st century
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
The SCAM Approach to Copy Detection in Digital Libraries
The SCAM Approach to Copy Detection in Digital Libraries
Model checking electronic commerce protocols
WOEC'96 Proceedings of the 2nd conference on Proceedings of the Second USENIX Workshop on Electronic Commerce - Volume 2
Text plagiarism detection method based on path patterns
International Journal of Business Intelligence and Data Mining
Retrieving similar documents from the web
Journal of Web Engineering
CoDet: sentence-based containment detection in news corpora
Proceedings of the 20th ACM international conference on Information and knowledge management
Hi-index | 0.00 |
This paper analyses the efficiency of different data structures for detecting overlap in digital documents. Most existing approaches use some hash function to reduce the space requirements for their indices of chunks. Since a hash function can produce the same value for different chunks, false matches are possible. In this paper we propose an algorithm that can be used for eliminating those false matches. This algorithm uses a suffix tree structure, which is space consuming. We define a modified suffix tree that only considers chunks starting at the beginning of words and we show how the algorithm can work on this structure. We can alternatively reduce space requirements of a suffix tree by converting it to a directed acyclic graph. We show that suffix link information can be preserved in this new structure and the matching statistics algorithm still works with those modifications that we propose.