Size estimation of non-cooperative data collections

Authors:
Mohammadreza Khelghati;Djoerd Hiemstra;Maurice van Keulen
Affiliations:
University of Twente, Netherlands;University of Twente, Netherlands;University of Twente, Netherlands
Venue:
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Year:
2012

Citing 16
Cited 0

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Introduction to Monte Carlo methods

Learning in graphical models
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Discovering the representative of a search engine

Proceedings of the tenth international conference on Information and knowledge management
Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
Estimating collection size with logistic regression

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Generalising multiple capture-recapture to non-uniform sample sizes

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Estimating deep web data source size by capture---recapture method

Information Retrieval
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Efficient Search Engine Measurements

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping the crawling or sampling processes which can be so costly in some cases [14]. This tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [7], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [19], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in the case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [17]. In this paper, the suggested approaches for this purpose in the literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.