Duplicate page detection algorithm based on the field characteristic clustering

  • Authors:
  • Feiyue Ye;Junlei Liu;Bing Liu;Kun Chai

  • Affiliations:
  • Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China;Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China;Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China;Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China

  • Venue:
  • ICWL'10 Proceedings of the 2010 international conference on New horizons in web-based learning
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The speed and accuracy for the cognitive based interactive-computing is crucial in an information retrieval system of web wisdom. In this page, we propose a new duplicate detection algorithm based on the field characteristic clustering after the analysis of the common duplicate detection algorithm and finding their existing drawbacks. By using the field knowledge to build the characteristic string and taking advantage of the improved k-means clustering algorithm, we shorten the time in the comparison process for the duplicate detection. Finally, through the experiment to compare the performance of the traditional SCAM, DSC with this algorithm on the time consumption, the rate of accuracy and the recalling rate quality. The result shows this algorithm overcome the time and storage consumption when compared with the traditional SCAM algorithm. On comparison with another DSC algorithm, it improves the drawback of the inaccuracy brought by the use of shingles to representing a page in the duplicate detection process. We conclude the duplicate detection algorithm based on the field characteristic clustering raise its precision and recall rate in the field of web duplicate page detection and will improve the speed and accuracy in an information retrieval system of web wisdom.