Probabilistic models for focused web crawling

Authors:
Hongyu Liu;Evangelos Milios;Jeannette Janssen
Affiliations:
Dalhousie University;Dalhousie University;Dalhousie University
Venue:
Proceedings of the 6th annual ACM international workshop on Web information and data management
Year:
2004

Citing 18
Cited 10

The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
MySpiders: Evolve Your Own Intelligent Web Crawlers

Autonomous Agents and Multi-Agent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information Theory, Inference & Learning Algorithms

Information Theory, Inference & Learning Algorithms
Focused Crawling by Learning HMM from User's Topic-specific Browsing

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Unsupervised modeling and recognition of object categories with combination of visual contents and geometric similarity links

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Improving the performance of focused web crawlers

Data & Knowledge Engineering
State of the Art in Semantic Focused Crawlers

ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part II
Meta-evolution strategy to focused crawling on semantic web

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
A solution to the exact match on rare item searches: introducing the lost sheep algorithm

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
A novel focused crawler based on breadcrumb navigation

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Sentiment-focused web crawling

Proceedings of the 21st ACM international conference on Information and knowledge management
Topical crawling on the web through local site-searches

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

A Focused crawler must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modelling of context as well as the current observations. Probabilistic models, such as Hidden Markov Models(HMMs) and Conditional Random Fields(CRFs), can potentially capture both formatting and context. In this paper, we present the use of HMM for focused web crawling, and compare it with Best-First strategy. Furthermore, we discuss the concept of using CRFs to overcome the difficulties with HMMs and support the use of many, arbitrary and overlapping features. Finally, we describe a design of a system applying CRFs for focused web crawling, that is currently being implemented.