Searching distributed collections with inference networks
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Content routing: a scalable architecture for network-based information discovery
Content routing: a scalable architecture for network-based information discovery
STARTS: Stanford proposal for Internet meta-searching
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Effective retrieval with distributed collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Comparing the performance of database selection algorithms
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Scalable collection summarization and selection
Proceedings of the fourth ACM conference on Digital libraries
A decision-theoretic approach to database selection in networked IR
ACM Transactions on Information Systems (TOIS)
Efficient and effective metasearch for a large number of text databases
Proceedings of the eighth international conference on Information and knowledge management
GlOSS: text-source discovery over the Internet
ACM Transactions on Database Systems (TODS)
Server selection on the World Wide Web
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Collection selection and results merging with topically organized U.S. patents and TREC data
Proceedings of the ninth international conference on Information and knowledge management
Text Database Discovery on the Web: Neural Net Based Approach
Journal of Intelligent Information Systems
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A language modeling framework for resource selection and results merging
Proceedings of the eleventh international conference on Information and knowledge management
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
Improving Text Classification by Shrinkage in a Hierarchy of Classes
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Determining Text Databases to Search in the Internet
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet
Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques
HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3
Relevant document distribution estimation method for resource selection
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A Probabilistic Approach to Metasearching with Adaptive Probing
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Modeling and Managing Content Changes in Text Databases
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Server selection methods in hybrid portal search
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage statistical language models for text database selection
Information Retrieval
An evaluation of resource description quality measures
Proceedings of the 2006 ACM symposium on Applied computing
Towards better measures: evaluation of estimated resource description quality for distributed IR
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Capturing collection size for distributed non-cooperative retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed query sampling: a quality-conscious approach
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Web dynamics and their ramifications for the development of web search engines
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Classification-aware hidden-web text database selection
ACM Transactions on Information Systems (TOIS)
Ontology-Based Deep Web Data Sources Selection
HAIS '08 Proceedings of the 3rd international workshop on Hybrid Artificial Intelligence Systems
Contextualized query sampling to discover semantic resource descriptions on the web
Information Processing and Management: an International Journal
Robust result merging using sample-based score estimates
ACM Transactions on Information Systems (TOIS)
An evolutionary approach to query-sampling for heterogeneous systems
Expert Systems with Applications: An International Journal
PISA: A framework for integrating uncooperative peers into P2P-based federated search
Computer Communications
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement
Proceedings of the 20th international conference on World wide web
Foundations and Trends in Information Retrieval
Improving local search ranking through external logs
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evolutionary approach for semantic-based query sampling in large-scale information sources
Information Sciences: an International Journal
Adaptive query-based sampling of distributed collections
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Assessing relevance and trust of the deep web sources and results based on inter-source agreement
ACM Transactions on the Web (TWEB)
Agreement based source selection for the multi-topic deep web integration
Proceedings of the 17th International Conference on Management of Data
Hi-index | 0.00 |
Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.