A webpage deletion algorithm based on hierarchical filtering

  • Authors:
  • Xunxun Chen;Wei Wang;Dapeng Man;Sichang Xuan

  • Affiliations:
  • National Computer Network Emergency Response Technical Team Coordination Center, Beijing, China;School of Computer Science and Technology, Harbin Engineering University, Harbin, China;School of Computer Science and Technology, Harbin Engineering University, Harbin, China;School of Computer Science and Technology, Harbin Engineering University, Harbin, China

  • Venue:
  • WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Duplicate webpages can affect the user experience of search engine. This paper proposed webpage deletion algorithm based on hierarchical filtering according to the features of duplicate webpage. The webpage feature extraction is divided into three layers, which are paragraphs, sentences and words. The webpage features are formed by layer filtering redundant information. In the sentence layer paragraph sentences are extracted according to the sentence semantics, while in the word layer the sentences are denoised filtering based on statistics of the part of speech in them. This algorithm improves the noise immunity and the original coverage of the feature extraction. The experiments show that the proposed method can accurately filter out duplicate webpage.