An indent shape based approach for web lists mining

  • Authors:
  • Yanxu Zhu;Gang Yin;Huaimin Wang;Dianxi Shi;Xiang Li;Lin Yuan

  • Affiliations:
  • College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Electronic Technology, Information Engineering University, Zhengzhou, Henan, China

  • Venue:
  • WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Mining repeated patterns from HTML documents is a key step for typical applications of Web information extraction, which require efficient techniques of patterns mining to generate wrappers automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with a high precision, but their efficiency is still a challenge. In this paper, we present a novel approach for Web lists mining based on the indent shape of HTML documents. Indent shape is a simplified abstraction of HTML documents in which tandem repeated waves indicate the potential repeated patterns to be detected. By identifying the tandem repeated waves efficiently with a horizontal line scanning along an indent shape, the repeated patterns in the documents can be recognized, from which the lists of the target Web page can be extracted. Extensive experiments show that our approach achieves better performance and efficiency compared with existing approaches.