Automating Content Extraction of HTML Documents

Authors:
Suhit Gupta;Gail E. Kaiser;Peter Grimm;Michael F. Chiang;Justin Starren
Affiliations:
Department of Computer Sciences, Columbia University, New York, USA 10027;Department of Computer Sciences, Columbia University, New York, USA 10027;Department of Electrical Engineering, Columbia University, New York, USA 10027;Departments of Ophthalmology and Biomedical Informatics, Columbia University, New York, USA 10032;Departments of Biomedical Informatics and Radiology, Columbia University, New York, USA 10032
Venue:
World Wide Web
Year:
2005

Citing 15
Cited 23

Assistive technology computers and persons with disabilities

Communications of the ACM
Access to graphical interfaces for blind users

interactions
A new paradigm for browsing the web

CHI '95 Conference Companion on Human Factors in Computing Systems
Improving GUI accessibility for people with low vision

CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Improving the usability of speech-based interfaces for blind users

Assets '96 Proceedings of the second annual ACM conference on Assistive technologies
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Two approaches to bringing Internet services to WAP devices

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Accordion summarization for end-game browsing on PDAs and cellular phones

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Automatic identification and organization of index terms for interactive browsing

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Web content accessibility guidelines 1.0

interactions
Designing the User Interface: Strategies for Effective Human-Computer Interaction

Designing the User Interface: Strategies for Effective Human-Computer Interaction
Usability Engineering

Usability Engineering
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web

Extracting content from accessible web pages

W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
Extracting context to improve accuracy for HTML content extraction

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Personalizable edge services for web accessibility

W4A '06 Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A): Building the mobile web: rediscovering accessibility?
Adaptive web-page content identification

Proceedings of the 9th annual ACM international workshop on Web information and data management
Understanding web documents: finding pagelets for transformation using structural patterns

International Journal of Web Engineering and Technology
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Distilling Informative Content from HTML News Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
An empirical study on using hidden markov model for search interface segmentation

Proceedings of the 18th ACM conference on Information and knowledge management
RENS --- Enabling a Robot to Identify a Person

ICIRA '09 Proceedings of the 2nd International Conference on Intelligent Robotics and Applications
ContentEx: a framework for automatic content extraction programs

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
One approach to HTML wrappers creation: using Document Object Model tree

CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Blog post and comment extraction using information quantity of web format

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Print-friendly page extraction for web printing service

Proceedings of the 11th ACM symposium on Document engineering
Exploiting semantic structure for mapping user-specified form terms to SNOMED CT concepts

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
An automatic approach to displaying web applications as portlets

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Transaction models for Web accessibility

World Wide Web
Information extraction from webpages based on DOM distances

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Improving web accessibility for dichromat users through contrast preservation

ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Research and Implementation of Self-Publishing Website Platforms for Universities Based on CMS

International Journal of Advanced Pervasive and Ubiquitous Computing
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Automatic generation of limited-depth hyper-documents from clinical guidelines

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting," which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction." We have developed a framework that employs an easily extensible set of techniques. It incorporates advantages of previous work on content extraction. Our key insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages. This proxy can be used both centrally, administered for groups of users, as well as by individuals for personal browsers. We have also, after receiving feedback from users about the proxy, created a revised version with improved performance and accessibility in mind.