A constrained crawling approach and its application to a specialised search engine

Authors:
Mehdi Adda
Affiliations:
Department of Computer Science, Engineering and Mathematics, University of Quebec at Rimouski, 300, allee des Ursulines, C.P. 3300, succ. A, Rimouski, Quebec, G5L 3A1, Canada
Venue:
International Journal of Information and Communication Technology
Year:
2011

Citing 31
Cited 0

A vector space model for automatic indexing

Readings in information retrieval
Adaptive information agents in distributed textual environments

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Information retrieval on the web

ACM Computing Surveys (CSUR)
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Automatic information extraction from web pages

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Architecture of virtual machines

Proceedings of the workshop on virtual computer systems
Ontology-focused crawling of Web documents

Proceedings of the 2003 ACM symposium on Applied computing
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Learnable topic-specific web crawler

Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
Focused crawling by exploiting anchor text using decision tree

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Effective web crawling

ACM SIGIR Forum
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004
A novel hybrid focused crawling algorithm to build domain-specific collections

A novel hybrid focused crawling algorithm to build domain-specific collections
Pro ActiveRecord: Databases with Ruby and Rails (Pro)

Pro ActiveRecord: Databases with Ruby and Rails (Pro)
Designing clustering-based web crawling policies for search engine crawlers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
Ruby On Rails Bible

Ruby On Rails Bible
Improving the performance of focused web crawlers

Data & Knowledge Engineering
Programming Ruby 1.9: The Pragmatic Programmers' Guide

Programming Ruby 1.9: The Pragmatic Programmers' Guide
Beginning CouchDB

Beginning CouchDB
Professional IronRuby

Professional IronRuby
SQL databases v. NoSQL databases

Communications of the ACM
A rough set approach to classifying web page without negative examples

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
CentOS Bible

CentOS Bible
At the forge: MongoDB

Linux Journal
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present an approach to crawl and parse websites based on their logical structure rather than on an aleatory exploration method. In this approach, we use a set of constraints to identify web pages and their components. To enforce these constraints, we present a set of primitives that rely on predicate verification. Our model has the attractiveness of being flexible to reflect tree-like logical structures of websites, thus it avoids the need to use complex information analysis and content classification techniques. Furthermore, because the model is implemented as a domain specific language (DSL), describing crawling tasks is straightforward. Using this DSL, we developed and deployed a prototype of dynamic web application with full-text search capabilities that periodically crawls, parses, and analyses the content of selected online newspapers. A set of experiments, and comparisons highlight the effectiveness of the proposed crawling approach.