Data model and query evaluation in global information systems
Journal of Intelligent Information Systems - Special issue: networked information discovery and retrieval
Ariadne: a system for constructing mediators for Internet sources
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A Web-based information system that reasons with structured collections of text
AGENTS '98 Proceedings of the second international conference on Autonomous agents
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Visualizing web site comparisons
Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Wrapper Generation via Grammar Induction
ECML '00 Proceedings of the 11th European Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
Web Contents Extracting for Web-Based Learning
ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Finding and using the content texts of HTML pages
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Knowledge discovery in web-directories: finding term-relations to build a business ontology
EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Hybrid model of content extraction
Journal of Computer and System Sciences
Hi-index | 0.00 |
Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative “primary content blocks” from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the “primary content blocks” based on their features. None of these algorithms require any supervised learning, but still can identify the “primary content blocks” with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.