Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient Browsing of Web Search Results on Mobile Devices Based on Block Importance Model
PERCOM '05 Proceedings of the Third IEEE International Conference on Pervasive Computing and Communications
A Semantic-web based framework for developing applications to improve accessibility in the WWW
W4A '06 Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A): Building the mobile web: rediscovering accessibility?
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Using main content extraction to improve performance of Vietnamese web page classification
Proceedings of the Second Symposium on Information and Communication Technology
Hybrid model of content extraction
Journal of Computer and System Sciences
Automatic Extraction of Blog Post from Diverse Blog Pages
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
This short paper gives an overview of the principles behind an algorithm that separates the core-content of a web document from hyperlinked-clutter such as text advertisements and long links of syndicated references to other resources.Its advantage over other approaches is its ability to identify both loosely as well as tightly defined "table-like" or "list-like" structures of hyperlinks (from nested tables to simple, bullet-pointed lists) by operating at various levels within the DOM tree.The resulting data can then be used to extract the core-content from a web document for semantic analysis or other information retrieval purposes as well as to aid in the process of "clipping" a web document to its bare essentials for use with hardware-limited devices such as PDAs and cell phones.