Extracting context to improve accuracy for HTML content extraction

Authors:
Suhit Gupta;Gail Kaiser;Salvatore Stolfo
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY
Venue:
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Year:
2005

Citing 4
Cited 10

Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Automating Content Extraction of HTML Documents

World Wide Web

Efficient web browsing on small screens

AVI '08 Proceedings of the working conference on Advanced visual interfaces
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Web mediators for accessible browsing

ERCIM'06 Proceedings of the 9th conference on User interfaces for all
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An effective web page layout adaptation for various resolutions

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
User-centric adaptation of Web information for small screens

Journal of Visual Languages and Computing
Editorial: Occupation inference through detection and classification of biographical activities

Data & Knowledge Engineering
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous work on content extraction utilized various heuristics such as link to text ratio, prominence of tables, and identification of advertising. Many of these heuristics were associated with "settings", whereby some heuristics could be turned on or off and others parameterized by minimum or maximum threshold values. A given collection of settings - such as removing table cells with high linked to non-linked text ratios and removing all apparent advertising -- might work very well for a news website, but leave little or no content left for the reader of a shopping site or a web portal We present a new technique, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.