Comparison and Classification of Documents Based on Layout Similarity
Information Retrieval
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Semantic text similarity using corpus-based word similarity and string similarity
ACM Transactions on Knowledge Discovery from Data (TKDD)
Adaptive Spam Filtering Based on Fingerprint Vectors
CCCM '08 Proceedings of the 2008 ISECS International Colloquium on Computing, Communication, Control, and Management - Volume 01
Hi-index | 0.00 |
A fingerprinting algorithm and sequence alignment are used widely to calculate the similarity of documents. The fingerprinting method is simple and fast but it cannot find specific similar regions. A string alignment method is used to identify similar regions by arranging sequences of strings. This has the advantage that it can find specific similar regions, but it also has the disadvantage that it requires more computational time. Multi-level alignment (MLA) is a new method, which was designed to exploit the advantages of both methods. MLA divides input documents into uniform length blocks, before extracting the fingerprint from each block and calculating the similarity of block pairs by comparing fingerprints. A similarity table is created during this process. Finally, sequence alignment is used to identify the longest similar regions in the similarity table. MLA allows users to change the block's size to control the relative proportion of the fingerprint algorithm and sequence alignment. A document is divided into several block, so similar regions are also fragmented into two or more blocks. To address this fragmentation problem, we propose a united block method. The united block method integrates adjacent fragmented similar regions to increase the similarity value. Our experiments demonstrated that computing a document's similarity using the united block method was more accurate than the original MLA method, with minor reductions in time.