Text Extraction from the Web via Text-to-Tag Ratio

Authors:
Tim Weninger;William H. Hsu
Affiliations:
-;-
Venue:
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Year:
2008

Citing 0
Cited 9

Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Automatic web information extraction based on rules

WISE'11 Proceedings of the 12th international conference on Web information system engineering
"Then click ok!": extracting references to interface elements in online documentation

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
An architecture-centered framework for developing blog crawlers

Proceedings of the 27th Annual ACM Symposium on Applied Computing
RetriBlog: a framework for creating blog crawlers

Proceedings of the 27th Annual ACM Symposium on Applied Computing
RetriBlog: An architecture-centered framework for developing blog crawlers

Expert Systems with Applications: An International Journal
On text preprocessing for opinion mining outside of laboratory environments

AMT'12 Proceedings of the 8th international conference on Active Media Technology
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a method to extract content text from diverse Web pages by using the HTML document's Text-To-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-To-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.