Exploiting Multiple Features with MEMMs for Focused Web Crawling

Authors:
Hongyu Liu;Evangelos Milios;Larry Korba
Affiliations:
National Research Council Institute for Information Technology, Canada Faculty of Computer Science, Dalhousie University, Canada;National Research Council Institute for Information Technology, Canada Faculty of Computer Science, Dalhousie University, Canada;National Research Council Institute for Information Technology, Canada Faculty of Computer Science, Dalhousie University, Canada
Venue:
NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Year:
2008

Citing 15
Cited 1

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Ontology-focused crawling of Web documents

Proceedings of the 2003 ACM symposium on Applied computing
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004

A vertical search engine based on visual and textual features

Edutainment'10 Proceedings of the Entertainment for education, and 5th international conference on E-learning and games

Quantified Score

Hi-index	0.00

Visualization

Abstract

Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models(MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.