Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

Authors:
Jong-kyu Seo;Hae-sung Tak;Hwan-gue Cho
Affiliations:
Pusan National University, Korea;Pusan National University, Korea;Pusan National University, Korea
Venue:
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Year:
2014

Citing 4
Cited 0

Comparison and Classification of Documents Based on Layout Similarity

Information Retrieval
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data (TKDD)
Adaptive Spam Filtering Based on Fingerprint Vectors

CCCM '08 Proceedings of the 2008 ISECS International Colloquium on Computing, Communication, Control, and Management - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

A fingerprinting algorithm and sequence alignment are used widely to calculate the similarity of documents. The fingerprinting method is simple and fast but it cannot find specific similar regions. A string alignment method is used to identify similar regions by arranging sequences of strings. This has the advantage that it can find specific similar regions, but it also has the disadvantage that it requires more computational time. Multi-level alignment (MLA) is a new method, which was designed to exploit the advantages of both methods. MLA divides input documents into uniform length blocks, before extracting the fingerprint from each block and calculating the similarity of block pairs by comparing fingerprints. A similarity table is created during this process. Finally, sequence alignment is used to identify the longest similar regions in the similarity table. MLA allows users to change the block's size to control the relative proportion of the fingerprint algorithm and sequence alignment. A document is divided into several block, so similar regions are also fragmented into two or more blocks. To address this fragmentation problem, we propose a united block method. The united block method integrates adjacent fragmented similar regions to increase the similarity value. Our experiments demonstrated that computing a document's similarity using the united block method was more accurate than the original MLA method, with minor reductions in time.