A vector space model for automatic indexing
Readings in information retrieval
Adaptive information agents in distributed textual environments
AGENTS '98 Proceedings of the second international conference on Autonomous agents
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Information retrieval on the web
ACM Computing Surveys (CSUR)
Intelligent crawling on the World Wide Web with arbitrary predicates
Proceedings of the 10th international conference on World Wide Web
Automatic information extraction from web pages
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Architecture of virtual machines
Proceedings of the workshop on virtual computer systems
Ontology-focused crawling of Web documents
Proceedings of the 2003 ACM symposium on Applied computing
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Learnable topic-specific web crawler
Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
Focused crawling by exploiting anchor text using decision tree
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
ACM SIGIR Forum
Learning to crawl: Comparing classification schemes
ACM Transactions on Information Systems (TOIS)
Using HMM to learn user browsing patterns for focused web crawling
Data & Knowledge Engineering - Special issue: WIDM 2004
A novel hybrid focused crawling algorithm to build domain-specific collections
A novel hybrid focused crawling algorithm to build domain-specific collections
Pro ActiveRecord: Databases with Ruby and Rails (Pro)
Pro ActiveRecord: Databases with Ruby and Rails (Pro)
Designing clustering-based web crawling policies for search engine crawlers
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
Ruby On Rails Bible
Improving the performance of focused web crawlers
Data & Knowledge Engineering
Programming Ruby 1.9: The Pragmatic Programmers' Guide
Programming Ruby 1.9: The Pragmatic Programmers' Guide
Beginning CouchDB
Professional IronRuby
SQL databases v. NoSQL databases
Communications of the ACM
A rough set approach to classifying web page without negative examples
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Efficiently detecting webpage updates using samples
ICWE'07 Proceedings of the 7th international conference on Web engineering
CentOS Bible
Linux Journal
Clustering-based incremental web crawling
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
In this paper, we present an approach to crawl and parse websites based on their logical structure rather than on an aleatory exploration method. In this approach, we use a set of constraints to identify web pages and their components. To enforce these constraints, we present a set of primitives that rely on predicate verification. Our model has the attractiveness of being flexible to reflect tree-like logical structures of websites, thus it avoids the need to use complex information analysis and content classification techniques. Furthermore, because the model is implemented as a domain specific language (DSL), describing crawling tasks is straightforward. Using this DSL, we developed and deployed a prototype of dynamic web application with full-text search capabilities that periodically crawls, parses, and analyses the content of selected online newspapers. A set of experiments, and comparisons highlight the effectiveness of the proposed crawling approach.