Duplicate page detection algorithm based on the field characteristic clustering

Authors:
Feiyue Ye;Junlei Liu;Bing Liu;Kun Chai
Affiliations:
Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China;Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China;Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China;Dept. of Computer Engineering and Science, Shanghai University, Shanghai, China
Venue:
ICWL'10 Proceedings of the 2010 international conference on New horizons in web-based learning
Year:
2010

Citing 2
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
K-means Clustering Algorithm with Improved Initial Center

WKDD '09 Proceedings of the 2009 Second International Workshop on Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The speed and accuracy for the cognitive based interactive-computing is crucial in an information retrieval system of web wisdom. In this page, we propose a new duplicate detection algorithm based on the field characteristic clustering after the analysis of the common duplicate detection algorithm and finding their existing drawbacks. By using the field knowledge to build the characteristic string and taking advantage of the improved k-means clustering algorithm, we shorten the time in the comparison process for the duplicate detection. Finally, through the experiment to compare the performance of the traditional SCAM, DSC with this algorithm on the time consumption, the rate of accuracy and the recalling rate quality. The result shows this algorithm overcome the time and storage consumption when compared with the traditional SCAM algorithm. On comparison with another DSC algorithm, it improves the drawback of the inaccuracy brought by the use of shingles to representing a page in the duplicate detection process. We conclude the duplicate detection algorithm based on the field characteristic clustering raise its precision and recall rate in the field of web duplicate page detection and will improve the speed and accuracy in an information retrieval system of web wisdom.