Ranking bias in deep web size estimation using capture recapture method

Authors:
Jianguo Lu
Affiliations:
School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, Canada
Venue:
Data & Knowledge Engineering
Year:
2010

Citing 29
Cited 5

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
DEQUE: querying the deep web

Data & Knowledge Engineering
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
Adopting Wildlife Experiments for Web Evolution Estimations: The Role of an AI Web Page Classifier

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Evaluating sampling methods for uncooperative collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating collection size with logistic regression

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Selectivity estimation of range queries based on data density approximation via cosine series

Data & Knowledge Engineering
Fishing for phishes: applying capture-recapture methods to estimate phishing populations

Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
Generalising multiple capture-recapture to non-uniform sample sizes

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Random sampling from a search engine's index

Journal of the ACM (JACM)
Capture-recapture in software unit testing: a case study

Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Efficient estimation of the size of text deep web data source

Proceedings of the 17th ACM conference on Information and knowledge management
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Estimating deep web data source size by capture---recapture method

Information Retrieval
Improving the evaluation of web search systems

ECIR'03 Proceedings of the 25th European conference on IR research
Can we correctly estimate the total number of pages in Google for a specific language?

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Sampling hidden objects using nearest-neighbor oracles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Expertise ranking using activity and contextual link measures

Data & Knowledge Engineering
Sampling online social networks by random walk

Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many deep web data sources are ranked data sources, i.e., they rank the matched documents and return at most the top k number of results even though there are more than k documents matching the query. While estimating the size of such ranked deep web data source, it is well known that there is a ranking bias-the traditional methods tend to underestimate the size when queries overflow (match more documents than the return limit). Numerous estimation methods have been proposed to overcome the ranking bias, such as by avoiding overflowing queries during the sampling process, or by adjusting the initial estimation using a fixed function. We observe that the overflow rate has a direct impact on the accuracy of the estimation. Under certain conditions, the actual size is close to the estimation obtained by unranked model multiplied by the overflow rate. Based on this result, this paper proposes a method that allows overflowing queries in the sampling process.