Text Extraction from the Web via Text-to-Tag Ratio

  • Authors:
  • Tim Weninger;William H. Hsu

  • Affiliations:
  • -;-

  • Venue:
  • DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a method to extract content text from diverse Web pages by using the HTML document's Text-To-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-To-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.