Verifying genre-based clustering approach to content extraction

Authors:
Suhit Gupta;Hila Becker;Gail Kaiser;Salvatore Stolfo
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 2
Cited 2

DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Context-based content extraction of html documents

Context-based content extraction of html documents

Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Testing a genre-enabled application: a preliminary assessment

FDIA'08 Proceedings of the 2nd BCS IRSG conference on Future Directions in Information Access

Quantified Score

Hi-index	0.00

Visualization

Abstract

The content of a webpage is usually contained within a small body of text and images, or perhaps several articles on the same page; however, the content may be lost in the clutter, particularly hurting users browsing on small cell phone and PDA screens and visually impaired users relying on speed rendering of web pages. Using the genre of a web page, we have created a solution, Crunch that automatically identifies clutter and removes it, thus leaving a clean content-full page. In order to evaluate the improvement in the applications for this technology, we identified a number of experiments. In this paper, we have those experiments, the associated results and their evaluation.