Automatic web information extraction based on rules

  • Authors:
  • Fanghuai Hu;Tong Ruan;Zhiqing Shao;Jun Ding

  • Affiliations:
  • Department of Computer Science and Engineering, East China University of Science and Technology;Department of Computer Science and Engineering, East China University of Science and Technology;Department of Computer Science and Engineering, East China University of Science and Technology;Department of Computer Science and Engineering, East China University of Science and Technology

  • Venue:
  • WISE'11 Proceedings of the 12th international conference on Web information system engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web Information Extraction is the initial step of effective web mining. In this article a few heuristic rules which describe the characteristics of the main content of web pages are summarized. The rules are constructed by some pre-defined terms and metrics, which can be considered as reusable and extensible for different kinds of HTML pages. Afterwards, a probabilistic model which utilizes the rules and metrics is suggested and the corresponding algorithm is implemented. The algorithm is tested on 1000 randomly selected web pages. The experiment shows that the algorithm is more precise and more applicable to the diverse structure of different web sites than other algorithms.