A linear space algorithm for computing maximal common subsequences
Communications of the ACM
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
ACM SIGMOD Record
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Web Informative Structures and Contents Based on Entropy Analysis
IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Automating Content Extraction of HTML Documents
World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Extracting context to improve accuracy for HTML content extraction
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
Separating XHTML content from navigation clutter using DOM-structure block analysis
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Verifying genre-based clustering approach to content extraction
Proceedings of the 15th international conference on World Wide Web
Text Extraction from the Web via Text-to-Tag Ratio
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Content Code Blurring: A New Approach to Content Extraction
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Identifying content blocks from web documents
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
One approach to HTML wrappers creation: using Document Object Model tree
CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Automatic Extraction of Blog Post from Diverse Blog Pages
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web document.