The shark-search algorithm. An application: tailored Web site mapping
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web
Machine Learning - Special issue on information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates
Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
MySpiders: Evolve Your Own Intelligent Web Crawlers
Autonomous Agents and Multi-Agent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information Theory, Inference & Learning Algorithms
Information Theory, Inference & Learning Algorithms
Focused Crawling by Learning HMM from User's Topic-specific Browsing
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers
Information Retrieval
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Structure-driven crawler generation by example
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
An adaptive crawler for locating hidden-Web entry points
Proceedings of the 16th international conference on World Wide Web
MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Improving the performance of focused web crawlers
Data & Knowledge Engineering
State of the Art in Semantic Focused Crawlers
ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part II
Meta-evolution strategy to focused crawling on semantic web
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
A solution to the exact match on rare item searches: introducing the lost sheep algorithm
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
A novel focused crawler based on breadcrumb navigation
ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Sentiment-focused web crawling
Proceedings of the 21st ACM international conference on Information and knowledge management
Topical crawling on the web through local site-searches
Journal of Web Engineering
Hi-index | 0.00 |
A Focused crawler must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modelling of context as well as the current observations. Probabilistic models, such as Hidden Markov Models(HMMs) and Conditional Random Fields(CRFs), can potentially capture both formatting and context. In this paper, we present the use of HMM for focused web crawling, and compare it with Best-First strategy. Furthermore, we discuss the concept of using CRFs to overcome the difficulties with HMMs and support the use of many, arbitrary and overlapping features. Finally, we describe a design of a system applying CRFs for focused web crawling, that is currently being implemented.