Finding and using the content texts of HTML pages

  • Authors:
  • Jun Ma;Zhumin Chen;Li Lian;Lianxia Li

  • Affiliations:
  • The Colledge of Computer Science and Technology, Shandong University, Jinan, China;The Colledge of Computer Science and Technology, Shandong University, Jinan, China;The Colledge of Computer Science and Technology, Shandong University, Jinan, China;The Colledge of Computer Science and Technology, Shandong University, Jinan, China

  • Venue:
  • AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A novel algorithm to find the content text in an HTML page is proposed based on a number of features of textual blocks in the page. Experiments show the new algorithm is better than known ones in terms of the ratios of the correctly removed noise blocks and the correctly found content blocks respectively. The application of the algorithm in hidden web classification is demonstrated as well.