Separating XHTML content from navigation clutter using DOM-structure block analysis

Authors:
Constantine Mantratzis;Mehmet Orgun;Steve Cassidy
Affiliations:
Macquarie University, Australia;Macquarie University, Australia;Macquarie University, Australia
Venue:
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Year:
2005

Citing 3
Cited 7

Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient Browsing of Web Search Results on Mobile Devices Based on Block Importance Model

PERCOM '05 Proceedings of the Third IEEE International Conference on Pervasive Computing and Communications

A Semantic-web based framework for developing applications to improve accessibility in the WWW

W4A '06 Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A): Building the mobile web: rediscovering accessibility?
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Using main content extraction to improve performance of Vietnamese web page classification

Proceedings of the Second Symposium on Information and Communication Technology
Hybrid model of content extraction

Journal of Computer and System Sciences
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

This short paper gives an overview of the principles behind an algorithm that separates the core-content of a web document from hyperlinked-clutter such as text advertisements and long links of syndicated references to other resources.Its advantage over other approaches is its ability to identify both loosely as well as tightly defined "table-like" or "list-like" structures of hyperlinks (from nested tables to simple, bullet-pointed lists) by operating at various levels within the DOM tree.The resulting data can then be used to extract the core-content from a web document for semantic analysis or other information retrieval purposes as well as to aid in the process of "clipping" a web document to its bare essentials for use with hardware-limited devices such as PDAs and cell phones.