Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

  • Authors:
  • Jong-kyu Seo;Hae-sung Tak;Hwan-gue Cho

  • Affiliations:
  • Pusan National University, Korea;Pusan National University, Korea;Pusan National University, Korea

  • Venue:
  • Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

A fingerprinting algorithm and sequence alignment are used widely to calculate the similarity of documents. The fingerprinting method is simple and fast but it cannot find specific similar regions. A string alignment method is used to identify similar regions by arranging sequences of strings. This has the advantage that it can find specific similar regions, but it also has the disadvantage that it requires more computational time. Multi-level alignment (MLA) is a new method, which was designed to exploit the advantages of both methods. MLA divides input documents into uniform length blocks, before extracting the fingerprint from each block and calculating the similarity of block pairs by comparing fingerprints. A similarity table is created during this process. Finally, sequence alignment is used to identify the longest similar regions in the similarity table. MLA allows users to change the block's size to control the relative proportion of the fingerprint algorithm and sequence alignment. A document is divided into several block, so similar regions are also fragmented into two or more blocks. To address this fragmentation problem, we propose a united block method. The united block method integrates adjacent fragmented similar regions to increase the similarity value. Our experiments demonstrated that computing a document's similarity using the united block method was more accurate than the original MLA method, with minor reductions in time.