Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates
Proceedings of the 10th international conference on World Wide Web
Web classification using support vector machine
Proceedings of the 4th international workshop on Web information and data management
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Handbook of massive data sets
Stochastic models for the Web graph
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Data mining for hypertext: a tutorial survey
ACM SIGKDD Explorations Newsletter
Probabilistic models for focused web crawling
Proceedings of the 6th annual ACM international workshop on Web information and data management
Mapping the Semantics of Web Text and Links
IEEE Internet Computing
Learning to crawl: Comparing classification schemes
ACM Transactions on Information Systems (TOIS)
Two-phase Web site classification based on Hidden Markov Tree models
Web Intelligence and Agent Systems
On the futility of blind search: An algorithmic view of “no free lunch”
Evolutionary Computation
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
An N-Gram Based Approach to Automatically Identifying Web Page Genre
HICSS '09 Proceedings of the 42nd Hawaii International Conference on System Sciences
A Hierarchy of Twofold Resource Allocation Automata Supporting Optimal Sampling
IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
The web as a graph: measurements, models, and methods
COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics
Scalability of findability: effective and efficient IR operations in large information networks
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Automatic checking of alternative texts on web pages
ICCHP'10 Proceedings of the 12th international conference on Computers helping people with special needs: Part I
Towards logical hypertext structure
IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Towards automatic assessment of government web sites
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
This paper proposes an approach for finding a single web page in a large web site or a cloud of web pages. We formalize this problem and map it to the exact match on rare item searches (EMRIS). The EMRIS is not much addressed in the literature, but many closely related problems exists. This paper presents a state-of-the-art survey on related problems in the fields of information retrieval, web page classification and directed search. As a solution to the EMRIS, this paper presents an innovative algorithm called the lost sheep. The lost sheep is specifically designed to work in web sites with of links, link texts and web pages. It works as a pre-classifier on link texts to decide if a web page is candidate for further evaluation. This paper also defines sound metrics to evaluated the EMRIS. The lost sheep outperforms all comparable algorithms both when it comes to maximizing accuracy and minimizing the number of downloaded pages.