Architecture for a parallel focused crawler for clickstream analysis

Authors:
Ali Selamat;Fatemeh Ahmadi-Abkenari
Affiliations:
UTM Knowledge Economy Research Alliance & Software Engineering Department, Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, Johor, Malaysia;UTM Knowledge Economy Research Alliance & Software Engineering Department, Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, Johor, Malaysia
Venue:
ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Year:
2011

Citing 13
Cited 1

Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Mining the Web's Link Structure

Computer
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Text Mining: Classification, Clustering, and Applications

Text Mining: Classification, Clustering, and Applications

An architecture for a focused trend parallel Web crawler with the application of clickstream analysis

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The tremendous growth of the Web poses many challenges for allpurpose single-process crawlers including the presence of some irrelevant answers among search results and the coverage and scaling issues regarding the enormous dimension of the World Wide Web. Meanwhile, more enhanced and convincing algorithms are on demand to yield more precise and relevant search results in an appropriate amount of time. Due to the fact that employing the link based Web page importance metrics in search engines is not an absolute solution to identify the best answer set by the overall search system and because employing such metrics within a multi-processes crawler bears a considerable communication overhead on the overall system, employing a link independent Web page importance metric is required to govern the priority rule within the queue of fetched URLs. The aim of this paper is to propose a modest weighted architecture for a focused structured parallel crawler in which the credit assignment to the discovered URLs is performed upon a combined metric based on clickstream analysis and Web page text similarity analysis to the specified mapped topic(s).