C4.5: programs for machine learning
C4.5: programs for machine learning
Interfaces for distributed systems of information servers
Journal of the American Society for Information Science
NetSerf: using semantic knowledge to find Internet information archives
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Searching distributed collections with inference networks
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Learning collection fusion strategies
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Content routing: a scalable architecture for network-based information discovery
Content routing: a scalable architecture for network-based information discovery
STARTS: Stanford proposal for Internet meta-searching
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Learning to Understand Information on the Internet: AnExample-Based Approach
Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Experiences with selecting search engines using metasearch
ACM Transactions on Information Systems (TOIS)
A probabilistic model for distributed information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition
Statistical methods for speech recognition
Effective retrieval with distributed collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating database selection techniques: a testbed and experiment
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection
ACM Transactions on Information Systems (TOIS)
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Comparing the performance of database selection algorithms
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A probabilistic solution to the selection and fusion problem in distributed information retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A decision-theoretic approach to database selection in networked IR
ACM Transactions on Information Systems (TOIS)
Efficient and effective metasearch for a large number of text databases
Proceedings of the eighth international conference on Information and knowledge management
GlOSS: text-source discovery over the Internet
ACM Transactions on Database Systems (TODS)
Server selection on the World Wide Web
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
The impact of database selection on distributed searching
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Query routing for Web search engines: architectures and experiments
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Collection selection and results merging with topically organized U.S. patents and TREC data
Proceedings of the ninth international conference on Information and knowledge management
Text Database Discovery on the Web: Neural Net Based Approach
Journal of Intelligent Information Systems
Efficient and effective metasearch for text databases incorporating linkages among documents
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Mining the web to create minority language corpora
Proceedings of the tenth international conference on Information and knowledge management
Extracting query modifications from nonlinear SVMs
Proceedings of the 11th international conference on World Wide Web
Two-stage language models for information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A language modeling framework for resource selection and results merging
Proceedings of the eleventh international conference on Information and knowledge management
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Determining Text Databases to Search in the Internet
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Proceedings of the 27th International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet
Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Detection of Heterogeneities in a Multiple Text Database Environment
COOPIS '99 Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems
Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques
HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3
Relevant document distribution estimation method for resource selection
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Pharos: a scalable distributed architecture for locating heterogeneous information sources
Pharos: a scalable distributed architecture for locating heterogeneous information sources
Language Modeling for Information Retrieval
Language Modeling for Information Retrieval
Comparing the performance of collection selection algorithms
ACM Transactions on Information Systems (TOIS)
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
A Probabilistic Approach to Metasearching with Adaptive Probing
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A study of smoothing methods for language models applied to information retrieval
ACM Transactions on Information Systems (TOIS)
When one sample is not enough: improving text database selection using shrinkage
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Unified utility maximization framework for resource selection
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Server selection methods in hybrid portal search
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling search engine effectiveness for federated search
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Central-rank-based collection selection in uncooperative distributed information retrieval
ECIR'07 Proceedings of the 29th European conference on IR research
Learning trees and rules with set-valued features
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Approximate content summary for database selection in deep web data integration
WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Foundations and Trends in Information Retrieval
Measuring similarity of chinese web databases based on category hierarchy
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
A multi-collection latent topic model for federated search
Information Retrieval
Information Systems
Hi-index | 0.00 |
Many valuable text databases on the web have noncrawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web” text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous research has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying. We present a novel “focused-probing” sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first to construct content summaries that include the frequencies of the words in the database. Unfortunately, Zipf's law practically guarantees that for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best categories for a query, and then sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words. We describe how to modify existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and that the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms.