Using HMM to learn user browsing patterns for focused web crawling

Authors:
Hongyu Liu;Jeannette Janssen;Evangelos Milios
Affiliations:
Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada;Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada;Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
Venue:
Data & Knowledge Engineering - Special issue: WIDM 2004
Year:
2006

Citing 17
Cited 14

Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
SVDPACKC (Version 1.0) User''s Guide

SVDPACKC (Version 1.0) User''s Guide
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers

Information Retrieval

Exploiting Multiple Features with MEMMs for Focused Web Crawling

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
An ontology-based approach to learnable focused crawling

Information Sciences: an International Journal
Improving the performance of focused web crawlers

Data & Knowledge Engineering
A Genre-Aware Approach to Focused Crawling

World Wide Web
The transition from web content accessibility guidelines 1.0 to 2.0: what this means for evaluation and repair

Proceedings of the 27th ACM international conference on Design of communication
SCTWC: An online semi-supervised clustering approach to topical web crawlers

Applied Soft Computing
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Automated browsing in AJAX websites

Data & Knowledge Engineering
A constrained crawling approach and its application to a specialised search engine

International Journal of Information and Communication Technology
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Semantic ranking of web pages based on formal concept analysis

Journal of Systems and Software
Editorial: A topic-specific crawling strategy based on semantics similarity

Data & Knowledge Engineering
A synergistic approach to efficient web searching

Intelligent Decision Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. To estimate the relevance of a newly seen URL, it must use information gleaned from previously crawled page sequences.In this paper, we present a new approach for prediction of the links leading to relevant pages based on a Hidden Markov Model (HMM). The system consists of three stages: user data collection, user modelling via sequential pattern learning, and focused crawling. In particular, we first collect the Web pages visited during a user browsing session. These pages are clustered, and the link structure among pages from different clusters is then used to learn page sequences that are likely to lead to target pages. The learning is performed using HMM. During crawling, the priority of links to follow is based on a learned estimate of how likely the page is to lead to a target page. We compare the performance with Context-Graph crawling and Best-First crawling. Our experiments demonstrate that this approach performs better than Context-Graph crawling and Best-First crawling.