Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm

  • Authors:
  • Mohamed Elhadi;Amjad Al-Tobi

  • Affiliations:
  • -;-

  • Venue:
  • CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 01
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and near-duplicate webpages. The process of tagging and aligning was based on set of sentences extracted from the web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.