Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Estimating web site readability using content extraction
Proceedings of the 18th international conference on World wide web
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Document word clouds: visualising web documents as tag clouds to aid users in relevance decisions
ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hybrid model of content extraction
Journal of Computer and System Sciences
Editorial: Occupation inference through detection and classification of biographical activities
Data & Knowledge Engineering
Proceedings of the twelfth international workshop on Web information and data management
Web news extraction via path ratios
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content Extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel Content Extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing Content Extraction solutions we show thatfor most documents content code blurring delivers the best results.