DOM-based content extraction of HTML documents

Authors:
Suhit Gupta;Gail Kaiser;David Neistadt;Peter Grimm
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY
Venue:
WWW '03 Proceedings of the 12th international conference on World Wide Web
Year:
2003

Citing 6
Cited 65

A new paradigm for browsing the web

CHI '95 Conference Companion on Human Factors in Computing Systems
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Two approaches to bringing Internet services to WAP devices

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Accordion summarization for end-game browsing on PDAs and cellular phones

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Automatic identification and organization of index terms for interactive browsing

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries

Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Fine-grained, structured configuration management for web projects

Proceedings of the 13th international conference on World Wide Web
Scaffolding visually cluttered web pages to facilitate accessibility

Proceedings of the working conference on Advanced visual interfaces
Display-agnostic hypermedia

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Integrating the web and the world: contextual trails on the move

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Collapse-to-zoom: viewing web pages on small screen devices by interactively removing irrelevant content

Proceedings of the 17th annual ACM symposium on User interface software and technology
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
Adapting Web Content to Mobile User Agents

IEEE Internet Computing
Extracting content from accessible web pages

W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
Extracting context to improve accuracy for HTML content extraction

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A general methodology for context-aware data access

Proceedings of the 4th ACM international workshop on Data engineering for wireless and mobile access
From the writable web to global editability

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Learning Object Models from Semistructured Web Documents

IEEE Transactions on Knowledge and Data Engineering
Verifying genre-based clustering approach to content extraction

Proceedings of the 15th international conference on World Wide Web
A Flexible Content Adaptation System Using a Rule-Based Approach

IEEE Transactions on Knowledge and Data Engineering
Web-based list question answering

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Vertical Navigation of Layout Adapted Web Documents

World Wide Web
Efficient web browsing on small screens

AVI '08 Proceedings of the working conference on Advanced visual interfaces
A user evaluation of the SADIe transcoder

Proceedings of the 10th international ACM SIGACCESS conference on Computers and accessibility
Spatial Relation Based Object Extraction from the World Wide Web

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Validation of streaming XML documents with abstract state machines

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Information extraction from syllabi for academic e-Advising

Expert Systems with Applications: An International Journal
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
An Informative DOM Subtree Identification Method from Web Pages in Unfamiliar Web Sites

IEICE - Transactions on Information and Systems
Where are your manners?: Sharing best community practices in the web 2.0

Proceedings of the 2009 ACM symposium on Applied Computing
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Web document text and images extraction using DOM analysis and natural language processing

Proceedings of the 9th ACM symposium on Document engineering
Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy

ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Retrieval of reading materials for vocabulary and reading practice

EANL '08 Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications
Automatic Web Pages Author Extraction

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
ContentEx: a framework for automatic content extraction programs

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Enhancing web page readability for non-native readers

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Clustering-based relevance feedback for web pages

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Automatic document structure detection for data integration

BIS'07 Proceedings of the 10th international conference on Business information systems
Development of automatic web accessibility checking modules for advanced quality assurance tools

UAHCI'07 Proceedings of the 4th international conference on Universal access in human computer interaction: coping with diversity
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
An open source web browser for visually impaired

ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Blog post and comment extraction using information quantity of web format

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Adapting web content for low-literacy readers by using lexical elaboration and named entities labeling

Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A)
An automatic HTTP cookie management system

Computer Networks: The International Journal of Computer and Telecommunications Networking
Adapting Web content for low-literacy readers by using lexical elaboration and named entities labeling

The New Review of Hypermedia and Multimedia - Web Accessibility
Find this for me: mobile information retrieval on the open web

Proceedings of the 16th international conference on Intelligent user interfaces
Link-based hidden attribute discovery for objects on Web

Proceedings of the 14th International Conference on Extending Database Technology
Generalized link suggestions via web site clustering

Proceedings of the 20th international conference on World wide web
Word clouds of multiple search results

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
Automating the selection of stories for AI in the news

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
DOM semantic expansion-based extraction of topical information from web pages

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Using main content extraction to improve performance of Vietnamese web page classification

Proceedings of the Second Symposium on Information and Communication Technology
A heuristic approach for topical information extraction from news pages

WISE'06 Proceedings of the 7th international conference on Web Information Systems
ESpotter: adaptive named entity recognition for web browsing

WM'05 Proceedings of the Third Biennial conference on Professional Knowledge Management
An effective web page layout adaptation for various resolutions

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
RSS feed generation from legacy HTML pages

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
User-centric adaptation of Web information for small screens

Journal of Visual Languages and Computing
Integrating data from the web by machine-learning tree-pattern queries

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Towards understanding the functions of web element

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Hybrid model of content extraction

Journal of Computer and System Sciences
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques

Proceedings of the 21st international conference companion on World Wide Web
Advanced information retrieval from web pages

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Accessible online content creation by end users

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Learning to crawl deep web

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction". We have developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction. Our key insight is to work with the DOM trees, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages.