Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Constructing Suffix Trees On-Line in Linear Time
Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm
CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 01
Hi-index | 0.00 |
Duplicate webpages can affect the user experience of search engine. This paper proposed webpage deletion algorithm based on hierarchical filtering according to the features of duplicate webpage. The webpage feature extraction is divided into three layers, which are paragraphs, sentences and words. The webpage features are formed by layer filtering redundant information. In the sentence layer paragraph sentences are extracted according to the sentence semantics, while in the word layer the sentences are denoised filtering based on statistics of the part of speech in them. This algorithm improves the noise immunity and the original coverage of the feature extraction. The experiments show that the proposed method can accurately filter out duplicate webpage.