A hybrid approach for extracting informative content from web pages

Authors:
Erdinç Uzun;Hayri Volkan Agun;TarıK Yerlikaya
Affiliations:
Namik Kemal University, Corlu Engineering Faculty, Computer Engineering Department, Çorlu, Tekirdağ, Turkey;Trakya University, Engineering and Architecture Faculty, Computer Engineering Department, Edirne, Turkey;Trakya University, Engineering and Architecture Faculty, Computer Engineering Department, Edirne, Turkey
Venue:
Information Processing and Management: an International Journal
Year:
2013

Citing 32
Cited 1

A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Output-Sensitive Algorithms for Computing Nearest-Neighbour Decision Boundaries

Discrete & Computational Geometry
Automatic Identification of Informative Sections of Web Pages

IEEE Transactions on Knowledge and Data Engineering
Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework

Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application

Information Processing and Management: an International Journal
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Searching strategies for the Bulgarian language

Information Retrieval
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
Adaptive web-page content identification

Proceedings of the 9th annual ACM international workshop on Web information and data management
Information retrieval on Turkish texts

Journal of the American Society for Information Science and Technology
A graph-theoretic approach to webpage segmentation

Proceedings of the 17th international conference on World Wide Web
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
A densitometric analysis of web template content

Proceedings of the 18th international conference on World wide web
Finding User's Interest Blocks using Significant Implicit Evidence for Web Browsing on Small Screen Devices

World Wide Web
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Comparing Bayesian network classifiers

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step.