A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Probe, count, and classify: categorizing hidden web databases
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Discovering the representative of a search engine
Proceedings of the eleventh international conference on Information and knowledge management
Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Relevant document distribution estimation method for resource selection
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Data & Knowledge Engineering
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating corpus size via queries
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements
Proceedings of the 16th international conference on World Wide Web
Adopting Wildlife Experiments for Web Evolution Estimations: The Role of an AI Web Page Classifier
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Evaluating sampling methods for uncooperative collections
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating collection size with logistic regression
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Selectivity estimation of range queries based on data density approximation via cosine series
Data & Knowledge Engineering
Fishing for phishes: applying capture-recapture methods to estimate phishing populations
Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
Generalising multiple capture-recapture to non-uniform sample sizes
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Random sampling from a search engine's index
Journal of the ACM (JACM)
Capture-recapture in software unit testing: a case study
Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement
Proceedings of the VLDB Endowment
Efficient estimation of the size of text deep web data source
Proceedings of the 17th ACM conference on Information and knowledge management
An Approach to Deep Web Crawling by Sampling
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Estimating deep web data source size by capture---recapture method
Information Retrieval
Improving the evaluation of web search systems
ECIR'03 Proceedings of the 25th European conference on IR research
Can we correctly estimate the total number of pages in Google for a specific language?
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Sampling hidden objects using nearest-neighbor oracles
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Expertise ranking using activity and contextual link measures
Data & Knowledge Engineering
Sampling online social networks by random walk
Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research
Size estimation of non-cooperative data collections
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Selecting queries from sample to crawl deep web data sources
Web Intelligence and Agent Systems
Hi-index | 0.00 |
Many deep web data sources are ranked data sources, i.e., they rank the matched documents and return at most the top k number of results even though there are more than k documents matching the query. While estimating the size of such ranked deep web data source, it is well known that there is a ranking bias-the traditional methods tend to underestimate the size when queries overflow (match more documents than the return limit). Numerous estimation methods have been proposed to overcome the ranking bias, such as by avoiding overflowing queries during the sampling process, or by adjusting the initial estimation using a fixed function. We observe that the overflow rate has a direct impact on the accuracy of the estimation. Under certain conditions, the actual size is close to the estimation obtained by unranked model multiplied by the overflow rate. Based on this result, this paper proposes a method that allows overflowing queries in the sampling process.