An approach to improving quality of crawlers using Naïve bayes for classifier and hyperlink filter

  • Authors:
  • Huu-Thien-Tan Nguyen;Duy-Khanh Le

  • Affiliations:
  • Yokogawa Electric International Pte. Ltd. Singapore, Singapore;National University of Singapore, Singapore

  • Venue:
  • ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays, most of search engines rely on keywords provided by users. However, keywords may not be sufficiently representative for the main topic of a web page. When searching for a topic, users input their desirable topic in terms of keywords. Keyword-based search engines will return pages that contain the keywords even though these pages are not about the topic. This limits the efficiency of these engines as they may return undesirable result. In this paper, we present an approach to improve the quality of search engines by focusing on web pages related to specific topics. Our system includes three main components: a crawler for gathering web pages, a classifier for classifying web pages by topics, and a hyperlink filter (or distiller) for filtering hyperlinks. We propose Naïve Bayes algorithms for classifier and distiller to enhance the accuracy of the system. We also implement and examine the efficiency of our system by gathering web pages in two topics: Artificial Intelligence and Motorcycle. The result shows that our crawler achieves performance improvements in efficiency over the ones that search by keywords.