A new paradigm for browsing the web
CHI '95 Conference Companion on Human Factors in Computing Systems
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Two approaches to bringing Internet services to WAP devices
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Accordion summarization for end-game browsing on PDAs and cellular phones
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Seeing the whole in parts: text summarization for web browsing on handheld devices
Proceedings of the 10th international conference on World Wide Web
Automatic identification and organization of index terms for interactive browsing
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Fine-grained, structured configuration management for web projects
Proceedings of the 13th international conference on World Wide Web
Scaffolding visually cluttered web pages to facilitate accessibility
Proceedings of the working conference on Advanced visual interfaces
Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Integrating the web and the world: contextual trails on the move
Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Proceedings of the 17th annual ACM symposium on User interface software and technology
Learning important models for web page blocks based on layout and content analysis
ACM SIGKDD Explorations Newsletter
Adapting Web Content to Mobile User Agents
IEEE Internet Computing
Extracting content from accessible web pages
W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
Extracting context to improve accuracy for HTML content extraction
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A general methodology for context-aware data access
Proceedings of the 4th ACM international workshop on Data engineering for wireless and mobile access
From the writable web to global editability
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Separating XHTML content from navigation clutter using DOM-structure block analysis
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Learning Object Models from Semistructured Web Documents
IEEE Transactions on Knowledge and Data Engineering
Verifying genre-based clustering approach to content extraction
Proceedings of the 15th international conference on World Wide Web
A Flexible Content Adaptation System Using a Rule-Based Approach
IEEE Transactions on Knowledge and Data Engineering
Web-based list question answering
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Vertical Navigation of Layout Adapted Web Documents
World Wide Web
Efficient web browsing on small screens
AVI '08 Proceedings of the working conference on Advanced visual interfaces
A user evaluation of the SADIe transcoder
Proceedings of the 10th international ACM SIGACCESS conference on Computers and accessibility
Spatial Relation Based Object Extraction from the World Wide Web
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Validation of streaming XML documents with abstract state machines
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Information extraction from syllabi for academic e-Advising
Expert Systems with Applications: An International Journal
Proceedings of the Second ACM International Conference on Web Search and Data Mining
An Informative DOM Subtree Identification Method from Web Pages in Unfamiliar Web Sites
IEICE - Transactions on Information and Systems
Where are your manners?: Sharing best community practices in the web 2.0
Proceedings of the 2009 ACM symposium on Applied Computing
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Profile-based focused crawling for social media-sharing websites
Journal on Image and Video Processing
Web document text and images extraction using DOM analysis and natural language processing
Proceedings of the 9th ACM symposium on Document engineering
Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy
ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Retrieval of reading materials for vocabulary and reading practice
EANL '08 Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications
Automatic Web Pages Author Extraction
FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
ContentEx: a framework for automatic content extraction programs
ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Enhancing web page readability for non-native readers
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Clustering-based relevance feedback for web pages
PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Automatic document structure detection for data integration
BIS'07 Proceedings of the 10th international conference on Business information systems
Development of automatic web accessibility checking modules for advanced quality assurance tools
UAHCI'07 Proceedings of the 4th international conference on Universal access in human computer interaction: coping with diversity
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
An open source web browser for visually impaired
ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Blog post and comment extraction using information quantity of web format
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A)
An automatic HTTP cookie management system
Computer Networks: The International Journal of Computer and Telecommunications Networking
The New Review of Hypermedia and Multimedia - Web Accessibility
Find this for me: mobile information retrieval on the open web
Proceedings of the 16th international conference on Intelligent user interfaces
Link-based hidden attribute discovery for objects on Web
Proceedings of the 14th International Conference on Extending Database Technology
Generalized link suggestions via web site clustering
Proceedings of the 20th international conference on World wide web
Word clouds of multiple search results
IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
Automating the selection of stories for AI in the news
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
DOM semantic expansion-based extraction of topical information from web pages
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Using main content extraction to improve performance of Vietnamese web page classification
Proceedings of the Second Symposium on Information and Communication Technology
A heuristic approach for topical information extraction from news pages
WISE'06 Proceedings of the 7th international conference on Web Information Systems
ESpotter: adaptive named entity recognition for web browsing
WM'05 Proceedings of the Third Biennial conference on Professional Knowledge Management
An effective web page layout adaptation for various resolutions
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
RSS feed generation from legacy HTML pages
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
User-centric adaptation of Web information for small screens
Journal of Visual Languages and Computing
Integrating data from the web by machine-learning tree-pattern queries
ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Towards understanding the functions of web element
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Hybrid model of content extraction
Journal of Computer and System Sciences
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques
Proceedings of the 21st international conference companion on World Wide Web
Advanced information retrieval from web pages
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Automatic Extraction of Blog Post from Diverse Blog Pages
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Accessible online content creation by end users
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Information Systems
Hi-index | 0.00 |
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction". We have developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction. Our key insight is to work with the DOM trees, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages.