Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Automating Content Extraction of HTML Documents
World Wide Web
Efficient web browsing on small screens
AVI '08 Proceedings of the working conference on Advanced visual interfaces
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Web mediators for accessible browsing
ERCIM'06 Proceedings of the 9th conference on User interfaces for all
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An effective web page layout adaptation for various resolutions
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
User-centric adaptation of Web information for small screens
Journal of Visual Languages and Computing
Editorial: Occupation inference through detection and classification of biographical activities
Data & Knowledge Engineering
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System
International Journal of Data Warehousing and Mining
Automatic Extraction of Blog Post from Diverse Blog Pages
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
Previous work on content extraction utilized various heuristics such as link to text ratio, prominence of tables, and identification of advertising. Many of these heuristics were associated with "settings", whereby some heuristics could be turned on or off and others parameterized by minimum or maximum threshold values. A given collection of settings - such as removing table cells with high linked to non-linked text ratios and removing all apparent advertising -- might work very well for a news website, but leave little or no content left for the reader of a shopping site or a web portal We present a new technique, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.