A solution to the exact match on rare item searches: introducing the lost sheep algorithm

Authors:
Morten Goodwin
Affiliations:
Tingtun AS, Lillesand, Norway
Venue:
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Year:
2011

Citing 21
Cited 1

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
High-performance web crawling

Handbook of massive data sets
Stochastic models for the Web graph

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Data mining for hypertext: a tutorial survey

ACM SIGKDD Explorations Newsletter
Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
Mapping the Semantics of Web Text and Links

IEEE Internet Computing
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems
On the futility of blind search: An algorithmic view of “no free lunch”

Evolutionary Computation
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
An N-Gram Based Approach to Automatically Identifying Web Page Genre

HICSS '09 Proceedings of the 42nd Hawaii International Conference on System Sciences
A Hierarchy of Twofold Resource Allocation Automata Supporting Optimal Sampling

IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
The web as a graph: measurements, models, and methods

COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics
Scalability of findability: effective and efficient IR operations in large information networks

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Automatic checking of alternative texts on web pages

ICCHP'10 Proceedings of the 12th international conference on Computers helping people with special needs: Part I
Towards logical hypertext structure

IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems
Learning Automata-Based Solutions to the Nonlinear Fractional Knapsack Problem With Applications to Optimal Resource Allocation

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Towards automatic assessment of government web sites

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an approach for finding a single web page in a large web site or a cloud of web pages. We formalize this problem and map it to the exact match on rare item searches (EMRIS). The EMRIS is not much addressed in the literature, but many closely related problems exists. This paper presents a state-of-the-art survey on related problems in the fields of information retrieval, web page classification and directed search. As a solution to the EMRIS, this paper presents an innovative algorithm called the lost sheep. The lost sheep is specifically designed to work in web sites with of links, link texts and web pages. It works as a pre-classifier on link texts to decide if a web page is candidate for further evaluation. This paper also defines sound metrics to evaluated the EMRIS. The lost sheep outperforms all comparable algorithms both when it comes to maximizing accuracy and minimizing the number of downloaded pages.