C4.5: programs for machine learning
C4.5: programs for machine learning
Machine Learning - Special issue on learning with probabilistic representations
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Function-based object model towards website adaptation
Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices
WWW '03 Proceedings of the 12th international conference on World Wide Web
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Output-Sensitive Algorithms for Computing Nearest-Neighbour Decision Boundaries
Discrete & Computational Geometry
Automatic Identification of Informative Sections of Web Pages
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application
Information Processing and Management: an International Journal
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Searching strategies for the Bulgarian language
Information Retrieval
Searching strategies for the Hungarian language
Information Processing and Management: an International Journal
Adaptive web-page content identification
Proceedings of the 9th annual ACM international workshop on Web information and data management
Information retrieval on Turkish texts
Journal of the American Society for Information Science and Technology
A graph-theoretic approach to webpage segmentation
Proceedings of the 17th international conference on World Wide Web
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
A densitometric analysis of web template content
Proceedings of the 18th international conference on World wide web
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Comparing Bayesian network classifiers
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Leveraging spatial join for robust tuple extraction from web pages
Information Sciences: an International Journal
Hi-index | 0.00 |
Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step.