Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Automatic web information extraction based on rules
WISE'11 Proceedings of the 12th international conference on Web information system engineering
"Then click ok!": extracting references to interface elements in online documentation
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
An architecture-centered framework for developing blog crawlers
Proceedings of the 27th Annual ACM Symposium on Applied Computing
RetriBlog: a framework for creating blog crawlers
Proceedings of the 27th Annual ACM Symposium on Applied Computing
RetriBlog: An architecture-centered framework for developing blog crawlers
Expert Systems with Applications: An International Journal
On text preprocessing for opinion mining outside of laboratory environments
AMT'12 Proceedings of the 8th international conference on Active Media Technology
Automatic Extraction of Blog Post from Diverse Blog Pages
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
We describe a method to extract content text from diverse Web pages by using the HTML document's Text-To-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-To-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.