Combining content extraction heuristics: the CombinE system

Authors:
Thomas Gottron
Affiliations:
Johannes Gutenberg-Universität Mainz, Mainz, Germany
Venue:
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Year:
2008

Citing 20
Cited 4

A linear space algorithm for computing maximal common subsequences

Communications of the ACM
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Wrapping web data into XML

ACM SIGMOD Record
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Automating Content Extraction of HTML Documents

World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Extracting context to improve accuracy for HTML content extraction

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Verifying genre-based clustering approach to content extraction

Proceedings of the 15th international conference on World Wide Web
Text Extraction from the Web via Text-to-Tag Ratio

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Content Code Blurring: A New Approach to Content Extraction

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems

One approach to HTML wrappers creation: using Document Object Model tree

CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web document.