Query-based sampling of text databases

Authors:
Jamie Callan;Margaret Connell
Affiliations:
Carnegie Mellon Univ.;Univ., of Massachusetts
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2001

Citing 30
Cited 147

Evaluation of an inference network-based retrieval model

ACM Transactions on Information Systems (TOIS) - Special issue on research and development in information retrieval
Inference networks for document retrieval

Inference networks for document retrieval
Numerical recipes in C (2nd ed.): the art of scientific computing

Numerical recipes in C (2nd ed.): the art of scientific computing
The effectiveness of GIOSS for the text database discovery problem

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
TREC and TIPSTER experiments with INQUERY

TREC-2 Proceedings of the second conference on Text retrieval conference
Dissemination of collection wide information in a distributed information retrieval system

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Learning collection fusion strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

Proceedings of the the seventh ACM conference on Hypertext
Word sense disambiguation for large text databases

Word sense disambiguation for large text databases
STARTS: Stanford proposal for Internet meta-searching

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A probabilistic model for distributed information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Multiple search engines in database merging

DL '97 Proceedings of the second ACM international conference on Digital libraries
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating database selection techniques: a testbed and experiment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection

ACM Transactions on Information Systems (TOIS)
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Comparing the performance of database selection algorithms

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A decision-theoretic approach to database selection in networked IR

ACM Transactions on Information Systems (TOIS)
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
The impact of database selection on distributed searching

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Collection selection and results merging with topically organized U.S. patents and TREC data

Proceedings of the ninth international conference on Information and knowledge management
Precision and recall of GIOSS estimators for database discovery

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Search and Ranking Algorithms for Locating Resources on the World Wide Web

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Determining Text Databases to Search in the Internet

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Estimating the Usefulness of Search Engines

ICDE '99 Proceedings of the 15th International Conference on Data Engineering

The effectiveness of query expansion for distributed information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Extending SDARTS: extracting metadata from web databases and interfacing with the open archives initiative

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Using sampled data and regression to merge search engine results

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Pruning long documents for distributed information retrieval

Proceedings of the eleventh international conference on Information and knowledge management
A language modeling framework for resource selection and results merging

Proceedings of the eleventh international conference on Information and knowledge management
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Early user---system interaction for database selection in massive domain-specific online environments

ACM Transactions on Information Systems (TOIS)
EDGAR-analyzer: automating the analysis of corporate data contained in the SEC's EDGAR database

Decision Support Systems - Web retrieval and mining
Automatically Selecting Strategies for Multi-Case-Base Reasoning

ECCBR '02 Proceedings of the 6th European Conference on Advances in Case-Based Reasoning
Automated discovery of search interfaces on the web

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Evaluating different methods of estimating retrieval quality for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Resource selection and data fusion in multimedia distributed digital libraries

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A semisupervised learning method to merge search engine results

ACM Transactions on Information Systems (TOIS)
From Retrieval Status Values to Probabilities of Relevance for Advanced IR Applications

Information Retrieval
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Query-related data extraction of hidden web documents

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Merging Results for Distributed Content Based Image Retrieval

Multimedia Tools and Applications
A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
Discovering and ranking web services with BASIL: a personalized approach with biased focus

Proceedings of the 2nd international conference on Service oriented computing
Guiding queries to information sources with InfoBeacons

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Client-system collaboration for legal corpus selection in an online production environment

ICAIL '03 Proceedings of the 9th international conference on Artificial intelligence and law
Modeling and Managing Content Changes in Text Databases

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Information source selection for resource constrained environments

ACM SIGMOD Record
Two-stage statistical language models for text database selection

Information Retrieval
Collaborative research - digital government: a language modeling approach to metadata for cross-database linkage and search

dg.o '04 Proceedings of the 2004 annual national conference on Digital government research
Automatic structured query transformation over distributed digital libraries

Proceedings of the 2006 ACM symposium on Applied computing
An evaluation of resource description quality measures

Proceedings of the 2006 ACM symposium on Applied computing
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Towards better measures: evaluation of estimated resource description quality for distributed IR

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed query sampling: a quality-conscious approach

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic construction of known-item finding test beds

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive query-based sampling for distributed IR

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying redundant search engines in a very large scale metasearch engine context

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Search and browse services for heterogeneous collections with the peer-to-peer network Pepper

Information Processing and Management: an International Journal
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Effective keyword-based selection of relational databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Recommenders in a personalized, collaborative digital library environment

Journal of Intelligent Information Systems
Modeling and managing changes in text databases

ACM Transactions on Database Systems (TODS)
On rank correlation in information retrieval evaluation

ACM SIGIR Forum
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Using query logs to establish vocabularies in distributed information retrieval

Information Processing and Management: an International Journal
Building simulated queries for known-item topics: an analysis using six european languages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Federated text retrieval from uncooperative overlapped collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Updating collection representations for federated search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques

IEEE Transactions on Parallel and Distributed Systems
Hybrid results merging

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Metadata harvesting for content-based distributed information retrieval

Journal of the American Society for Information Science and Technology
Classification-aware hidden-web text database selection

ACM Transactions on Information Systems (TOIS)
CCReSD: concept-based categorisation of Hidden Web databases

International Journal of High Performance Computing and Networking
User recommendation for collaborative and personalised digital archives

International Journal of Web Based Communities
Assessing multivariate Bernoulli models for information retrieval

ACM Transactions on Information Systems (TOIS)
Doxels in context for retrieval: from structure to neighbours

Proceedings of the 2008 ACM symposium on Applied computing
Analyzing the impact of churn and malicious behavior on the quality of peer-to-peer web search

Proceedings of the 2008 ACM symposium on Applied computing
A results merging algorithm for distributed information retrieval environments that combines regression methodologies with a selective download phase

Information Processing and Management: an International Journal
Mining world knowledge for analysis of search engine content

Web Intelligence and Agent Systems
Considering operational issues for multiagent conceptual inferencing in a distributed information retrieval application

Web Intelligence and Agent Systems
Discovering gis sources on the web using summaries

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Towards personalized distributed information retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Generalising multiple capture-recapture to non-uniform sample sizes

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Retrievability: an evaluation measure for higher order information access tasks

Proceedings of the 17th ACM conference on Information and knowledge management
Integral based source selection for uncooperative distributed information retrieval environments

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Ranking information resources in peer-to-peer text retrieval: an experimental study

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Contextualized query sampling to discover semantic resource descriptions on the web

Information Processing and Management: an International Journal
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
Web-scale extraction of structured data

ACM SIGMOD Record
A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Simple Adaptations of Data Fusion Algorithms for Source Selection

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Description logic programs under probabilistic uncertainty and fuzzy vagueness

International Journal of Approximate Reasoning
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Sources of evidence for vertical selection

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Effective query expansion for federated search

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
PISA: Federated Search in P2P Networks with Uncooperative Peers

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Analyzing Document Retrievability in Patent Retrieval Settings

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
An evolutionary approach to query-sampling for heterogeneous systems

Expert Systems with Applications: An International Journal
A signal-to-noise approach to score normalization

Proceedings of the 18th ACM conference on Information and knowledge management
Classification-based resource selection

Proceedings of the 18th ACM conference on Information and knowledge management
iNextCube: information network-enhanced text cube

Proceedings of the VLDB Endowment
Weighted Rank Correlation in Information Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Estimating deep web data source size by capture---recapture method

Information Retrieval
Web Crawling

Foundations and Trends in Information Retrieval
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
From uncertain inference to probability of relevance for advanced IR applications

ECIR'03 Proceedings of the 25th European conference on IR research
Central-rank-based collection selection in uncooperative distributed information retrieval

ECIR'07 Proceedings of the 29th European conference on IR research
Results merging algorithm using multiple regression models

ECIR'07 Proceedings of the 29th European conference on IR research
Collection profiling for collection fusion in distributed information retrieval systems

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
A semi-supervised learning method for motility disease diagnostic

CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications
An effective query relaxation solution for the deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Supporting keyword queries on structured databases with limited search interfaces

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
A joint probabilistic classification model for resource selection

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Examining the information retrieval process from an inductive perspective

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Reverted indexing for feedback and expansion

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Research proposal for distributed deep web search

PIKM '10 Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management
Modeling information sources as integrals for effective and efficient source selection

Information Processing and Management: an International Journal
Instance discovery and schema matching with applications to biological deep web data integration

DILS'10 Proceedings of the 7th international conference on Data integration in the life sciences
Structured data on the web

Communications of the ACM
Approximate content summary for database selection in deep web data integration

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
PISA: A framework for integrating uncooperative peers into P2P-based federated search

Computer Communications
Just-in-time analytics on large file systems

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Federated Search

Foundations and Trends in Information Retrieval
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sampling hidden objects using nearest-neighbor oracles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Heterogeneous network-based trust analysis: a survey

ACM SIGKDD Explorations Newsletter
A multi-collection latent topic model for federated search

Information Retrieval
Evolutionary approach for semantic-based query sampling in large-scale information sources

Information Sciences: an International Journal
Evaluating large-scale distributed vertical search

Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Towards distributed information retrieval in the semantic web: query reformulation using the oMAP framework

ESWC'06 Proceedings of the 3rd European conference on The Semantic Web: research and applications
CLEF 2005: multilingual retrieval by combining multiple multilingual ranked lists

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Adaptive query-based sampling of distributed collections

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Using information retrieval techniques to route queries in an infobeacons network

DBISP2P'04 Proceedings of the Second international conference on Databases, Information Systems, and Peer-to-Peer Computing
ACP2P: agent community based peer-to-peer information retrieval

AP2PC'04 Proceedings of the Third international conference on Agents and Peer-to-Peer Computing
Index-Based keyword search in mediator systems

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
A TNATS approach to hidden web documents

ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
sPLMap: a probabilistic approach to schema matching

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Term frequency normalisation tuning for BM25 and DFR models

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Improving retrievability with improved cluster-based pseudo-relevance feedback selection

Expert Systems with Applications: An International Journal
ACP2P: agent-community-based peer-to-peer information retrieval – an evaluation

AP2PC'05 Proceedings of the 4th international conference on Agents and Peer-to-Peer Computing
Information retrieval strategies for digitized handwritten medieval documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Utilizing inter-document similarities in federated search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
A grid-based infrastructure for distributed retrieval

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Federated search in the wild: the combined power of over a hundred search engines

Proceedings of the 21st ACM international conference on Information and knowledge management
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Studying the clustering paradox and scalability of search in highly distributed environments

ACM Transactions on Information Systems (TOIS)
A versatile tool for privacy-enhanced web search

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Reducing the uncertainty in resource selection

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Distributed information retrieval and applications

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Rank discovery from web databases

Proceedings of the VLDB Endowment
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems
Agreement based source selection for the multi-topic deep web integration

Proceedings of the 17th International Conference on Management of Data

Quantified Score

Hi-index	0.02

Visualization

Abstract

The proliferation of searchable text databases on corporate networks and the Internet causes a database selection problem for many people. Algorithms such as gGLOSS and CORI can automatically select which text databases to search for a given information need, but only if given a set of resource descriptions that accurately represent the contents of each database. The existing techniques for a acquiring resource descriptions have significant limitations when used in wide-area networks controlled by many parties. This paper presents query-based sampling, a new technicque for acquiring accurate resource descriptions. Query-based sampling does not require the cooperation of resource providers, nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are crated, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic dtabase selection.