A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
A study of thresholding strategies for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Induction of selective Bayesian classifiers
UAI'94 Proceedings of the Tenth international conference on Uncertainty in artificial intelligence
Hi-index | 0.00 |
Nowadays, most of search engines rely on keywords provided by users. However, keywords may not be sufficiently representative for the main topic of a web page. When searching for a topic, users input their desirable topic in terms of keywords. Keyword-based search engines will return pages that contain the keywords even though these pages are not about the topic. This limits the efficiency of these engines as they may return undesirable result. In this paper, we present an approach to improve the quality of search engines by focusing on web pages related to specific topics. Our system includes three main components: a crawler for gathering web pages, a classifier for classifying web pages by topics, and a hyperlink filter (or distiller) for filtering hyperlinks. We propose Naïve Bayes algorithms for classifier and distiller to enhance the accuracy of the system. We also implement and examine the efficiency of our system by gathering web pages in two topics: Artificial Intelligence and Motorcycle. The result shows that our crawler achieves performance improvements in efficiency over the ones that search by keywords.