Data-rich Section Extraction from HTML pages

  • Authors:
  • Jiying Wang;Frederick H. Lochovsky

  • Affiliations:
  • -;-

  • Venue:
  • WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a novel algorithm, DSE (Data-richSubtree Extraction) to recognize and extract the data-richsection of an HTML page. We apply the DSEalgorithm as a pre-processing "clean-up" step for twotypical web information retrieval problems: topicdistillation and web information extraction. Ourexperiments show that, for the test data sets we used, theDSE algorithm can correctly identify the data-rich sectionsof HTML pages with 100% accuracy. Therefore, it caneffectively reduce the root set size for the topic distillationproblem thereby improving the precision and accuracy ofthe HITS algorithm. Furthermore, when applied to the webinformation extraction problem using the IEPADalgorithm, it can decrease the number of patternsdiscovered by this algorithm, thus shortening its time costto generalize a wrapper for HTML pages.