When one sample is not enough: improving text database selection using shrinkage

Authors:
Panagiotis G. Ipeirotis;Luis Gravano
Affiliations:
Columbia University;Columbia University
Venue:
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Year:
2004

Citing 25
Cited 23

Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Content routing: a scalable architecture for network-based information discovery

Content routing: a scalable architecture for network-based information discovery
STARTS: Stanford proposal for Internet meta-searching

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Comparing the performance of database selection algorithms

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Scalable collection summarization and selection

Proceedings of the fourth ACM conference on Digital libraries
A decision-theoretic approach to database selection in networked IR

ACM Transactions on Information Systems (TOIS)
Efficient and effective metasearch for a large number of text databases

Proceedings of the eighth international conference on Information and knowledge management
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Collection selection and results merging with topically organized U.S. patents and TREC data

Proceedings of the ninth international conference on Information and knowledge management
Text Database Discovery on the Web: Neural Net Based Approach

Journal of Intelligent Information Systems
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A language modeling framework for resource selection and results merging

Proceedings of the eleventh international conference on Information and knowledge management
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Determining Text Databases to Search in the Internet

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A Probabilistic Approach to Metasearching with Adaptive Probing

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Modeling and Managing Content Changes in Text Databases

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Server selection methods in hybrid portal search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage statistical language models for text database selection

Information Retrieval
An evaluation of resource description quality measures

Proceedings of the 2006 ACM symposium on Applied computing
Towards better measures: evaluation of estimated resource description quality for distributed IR

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed query sampling: a quality-conscious approach

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Web dynamics and their ramifications for the development of web search engines

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Classification-aware hidden-web text database selection

ACM Transactions on Information Systems (TOIS)
Ontology-Based Deep Web Data Sources Selection

HAIS '08 Proceedings of the 3rd international workshop on Hybrid Artificial Intelligence Systems
Contextualized query sampling to discover semantic resource descriptions on the web

Information Processing and Management: an International Journal
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
An evolutionary approach to query-sampling for heterogeneous systems

Expert Systems with Applications: An International Journal
PISA: A framework for integrating uncooperative peers into P2P-based federated search

Computer Communications
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Federated Search

Foundations and Trends in Information Retrieval
Improving local search ranking through external logs

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evolutionary approach for semantic-based query sampling in large-scale information sources

Information Sciences: an International Journal
Adaptive query-based sampling of distributed collections

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Agreement based source selection for the multi-topic deep web integration

Proceedings of the 17th International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.