String editing and longest common subsequences
Handbook of formal languages, vol. 2
Delta algorithms: an empirical analysis
ACM Transactions on Software Engineering and Methodology (TOSEM)
Bounds on the Complexity of the Longest Common Subsequence Problem
Journal of the ACM (JACM)
Algorithms for the Longest Common Subsequence Problem
Journal of the ACM (JACM)
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Keeping Up with the Changing Web
Computer
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Engineering a Differencing and Compression Data Format
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Faster algorithms for string matching with k mismatches
Journal of Algorithms - Special issue: SODA 2000
Document Similarity Using a Phrase Indexing Graph Model
Knowledge and Information Systems
Web data extraction based on structural similarity
Knowledge and Information Systems
Neighbourhood Counting Metric for Sequences
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Semi-automated schema integration with SASMINT
Knowledge and Information Systems
Hi-index | 0.00 |
Management of large collection of replicated data in centralized or distributed environments is important for many systems that provide data mining, mirroring, storage, and content distribution. In its simplest form, the documents are generated, duplicated and updated by emails and web pages. Although redundancy may increase the reliability at a level, uncontrolled redundancy aggravates the retrieval performance and might be useless if the returned documents are obsolete. Document similarity matching algorithms do not provide the information on the differences of documents, and file synchronization algorithms are usually inefficient and ignore the structural and syntactic organization of documents. In this paper, we propose the S2S matching approach. The S2S matching is composed of structural and syntactic phases to compare documents. Firstly, in the structural phase, documents are decomposed into components by its syntax and compared at the coarse level. The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level. The structural mapping can be applied in a hierarchical way based on the structural organization of a document. Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch. Our two-phase S2S matching approach provides faster results than currently available string matching algorithms.