Learnable topic-specific web crawler

Authors:
A. Rungsawang;N. Angkawattanawit
Affiliations:
Department of Computer Engineering, Massive Information & Knowledge Engineering, Kasetsart University, 50 Pahol-Yothin Rd, Jatujak, Bangkok 10900, Thailand;Department of Computer Engineering, Massive Information & Knowledge Engineering, Kasetsart University, 50 Pahol-Yothin Rd, Jatujak, Bangkok 10900, Thailand
Venue:
Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
Year:
2005

Citing 20
Cited 10

Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Organizing topic-specific web information

HYPERTEXT '00 Proceedings of the eleventh ACM on Hypertext and hypermedia
WTMS: a system for collecting for collecting and analyzing topic-specific Web information

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Improvement of HITS-based algorithms on web documents

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Mercator: A scalable, extensible Web crawler

World Wide Web
MySpiders: Evolve Your Own Intelligent Web Crawlers

Autonomous Agents and Multi-Agent Systems
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)

Ant Focused Crawling Algorithm

ICAISC '08 Proceedings of the 9th international conference on Artificial Intelligence and Soft Computing
A Topic-Specific Web Crawler with Concept Similarity Context Graph Based on FCA

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Artificial Intelligence
SCTWC: An online semi-supervised clustering approach to topical web crawlers

Applied Soft Computing
Adaptive focused crawling

The adaptive web
A constrained crawling approach and its application to a specialised search engine

International Journal of Information and Communication Technology
Improvement of HITS for topic-specific web crawler

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Algorithm for generating fuzzy rules for WWW document classification

ICAISC'06 Proceedings of the 8th international conference on Artificial Intelligence and Soft Computing
Research on new algorithm of topic-oriented crawler and duplicated web pages detection

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
Semantic ranking of web pages based on formal concept analysis

Journal of Systems and Software
An approach for selecting seed URLs of focused crawler based on user-interest ontology

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the approaches of the first crawling. However, no one has ever mentioned some important questions, such as how the crawler performs during the next crawling attempts, can the crawler learn from experience to crawl more relevant web pages in an incremental way, etc. In this paper, we present an algorithm that covers the discussion of both the first and the consecutive crawling. For efficient result of the next crawling, we derive the information of previous crawling attempts to build some knowledge bases: starting URLs, topic keywords and URL prediction. These knowledge bases are used to build the experience of the learnable topic-specific web crawler to produce better result for the next crawling. Preliminary evaluation illustrates that the proposed web crawler can learn from experience to better collect the web pages under interest during the early period of consecutive crawling attempts.