Web news extraction based on path pattern mining

  • Authors:
  • Gong-Qing Wu;Xindong Wu;Xue-Gang Hu;Hai-Guang Li;Ying Liu;Ren-Gan Xu

  • Affiliations:
  • School of Computer Science and Information Engineering, Hefei University of Technology, Heifei, China;School of Computer Science and Information Engineering, Hefei University of Technology, Heifei, China and and Department of Computer Science, University of Vermont, Burlington;School of Computer Science and Information Engineering, Hefei University of Technology, Heifei, China;School of Computer Science and Information Engineering, Hefei University of Technology, Heifei, China;School of Computer Science and Information Engineering, Hefei University of Technology, Heifei, China;School of Computer Science and Information Engineering, Hefei University of Technology, Heifei, China

  • Venue:
  • FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and path patterns. Compared with the delimiting features of Web content, path patterns have many advantages, such as a high positioning accuracy, ease of use and a strong pervasive performance. Consequently, a Web information extraction model with path patterns constructed from a path pattern mining algorithm is proposed in this paper. Our experimental data set is obtained by randomly selecting news Web pages from the CNN website. With a reasonable tolerance threshold, the experimental results show that the average precision is above 99% and the average recall is 100% when we integrate Web information extraction with our path pattern mining algorithm. The performance of path patterns from the pattern mining algorithm is much better than that of priori extraction rules configured by domain knowledge.