Content Code Blurring: A New Approach to Content Extraction

Authors:
Thomas Gottron
Affiliations:
-
Venue:
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Year:
2008

Citing 0
Cited 9

Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Estimating web site readability using content extraction

Proceedings of the 18th international conference on World wide web
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Document word clouds: visualising web documents as tag clouds to aid users in relevance decisions

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hybrid model of content extraction

Journal of Computer and System Sciences
Editorial: Occupation inference through detection and classification of biographical activities

Data & Knowledge Engineering
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content Extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel Content Extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing Content Extraction solutions we show thatfor most documents content code blurring delivers the best results.