The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing
Communications of the ACM
Intelligent crawling on the World Wide Web with arbitrary predicates
Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawls, Tunneling, and Digital Libraries
ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Crawling the web: discovery and maintenance of large-scale web data
Crawling the web: discovery and maintenance of large-scale web data
Finding Buying Guides with a Web Carnivore
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Ontology-focused crawling of Web documents
Proceedings of the 2003 ACM symposium on Applied computing
Panorama: extending digital libraries with topical crawlers
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Probabilistic models for focused web crawling
Proceedings of the 6th annual ACM international workshop on Web information and data management
Focused crawling by exploiting anchor text using decision tree
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Learning to crawl: Comparing classification schemes
ACM Transactions on Information Systems (TOIS)
Semantic similarity methods in wordNet and their application to information retrieval on the web
Proceedings of the 7th annual ACM international workshop on Web information and data management
Link Contexts in Classifier-Guided Topical Crawlers
IEEE Transactions on Knowledge and Data Engineering
Focused crawling: experiences in a real world project
Proceedings of the 15th international conference on World Wide Web
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Transforming arbitrary tables into logical form with TARTAR
Data & Knowledge Engineering
Using HMM to learn user browsing patterns for focused web crawling
Data & Knowledge Engineering - Special issue: WIDM 2004
Proceedings of the 16th international conference on World Wide Web
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A novel hybrid focused crawling algorithm to build domain-specific collections
A novel hybrid focused crawling algorithm to build domain-specific collections
Competitor Mining with the Web
IEEE Transactions on Knowledge and Data Engineering
Measuring the semantic similarity of texts
EMSEE '05 Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
Application of structured document parsing to focused web crawling
Computer Standards & Interfaces
The SHARC framework for data quality in Web archiving
The VLDB Journal — The International Journal on Very Large Data Bases
A constrained crawling approach and its application to a specialised search engine
International Journal of Information and Communication Technology
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine
Web Semantics: Science, Services and Agents on the World Wide Web
FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
A novel focused crawler based on breadcrumb navigation
ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Confidence-Based incremental classification for objects with limited attributes in vertical search
IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Semantic ranking of web pages based on formal concept analysis
Journal of Systems and Software
A Generalized Links and Text Properties Based Forum Crawler
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Focused crawling of tagged web resources using ontology
Computers and Electrical Engineering
An analyst-adaptive approach to focused crawlers
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Editorial: A topic-specific crawling strategy based on semantics similarity
Data & Knowledge Engineering
Hi-index | 0.00 |
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.