S2S: structural-to-syntactic matching similar documents

Authors:
Ramazan S. Aygün
Affiliations:
University of Alabama in Huntsville, Computer Science Department, 35899, Huntsville, AL, USA
Venue:
Knowledge and Information Systems
Year:
2008

Citing 13
Cited 1

String editing and longest common subsequences

Handbook of formal languages, vol. 2
Delta algorithms: an empirical analysis

ACM Transactions on Software Engineering and Methodology (TOSEM)
Bounds on the Complexity of the Longest Common Subsequence Problem

Journal of the ACM (JACM)
Algorithms for the Longest Common Subsequence Problem

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Keeping Up with the Changing Web

Computer
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Engineering a Differencing and Compression Data Format

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Faster algorithms for string matching with k mismatches

Journal of Algorithms - Special issue: SODA 2000
Document Similarity Using a Phrase Indexing Graph Model

Knowledge and Information Systems
Web data extraction based on structural similarity

Knowledge and Information Systems
Neighbourhood Counting Metric for Sequences

Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006

Semi-automated schema integration with SASMINT

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Management of large collection of replicated data in centralized or distributed environments is important for many systems that provide data mining, mirroring, storage, and content distribution. In its simplest form, the documents are generated, duplicated and updated by emails and web pages. Although redundancy may increase the reliability at a level, uncontrolled redundancy aggravates the retrieval performance and might be useless if the returned documents are obsolete. Document similarity matching algorithms do not provide the information on the differences of documents, and file synchronization algorithms are usually inefficient and ignore the structural and syntactic organization of documents. In this paper, we propose the S2S matching approach. The S2S matching is composed of structural and syntactic phases to compare documents. Firstly, in the structural phase, documents are decomposed into components by its syntax and compared at the coarse level. The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level. The structural mapping can be applied in a hierarchical way based on the structural organization of a document. Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch. Our two-phase S2S matching approach provides faster results than currently available string matching algorithms.