Two-stage statistical language models for text database selection

Authors:
Hui Yang;Minjie Zhang
Affiliations:
School of Information Technology and Computer Science, University of Wollongong, Wollongong, Australia 2500;School of Information Technology and Computer Science, University of Wollongong, Wollongong, Australia 2500
Venue:
Information Retrieval
Year:
2006

Citing 40
Cited 1

On modeling of information retrieval concepts in vector spaces

ACM Transactions on Database Systems (TODS)
Towards an information logic

SIGIR '89 Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval
Inference networks for document retrieval

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic retrieval revisited

The Computer Journal - Special issue on information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Learning collection fusion strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
STARTS: Stanford proposal for Internet meta-searching

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A probabilistic model for distributed information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection

ACM Transactions on Information Systems (TOIS)
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Comparing the performance of database selection algorithms

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A probabilistic solution to the selection and fusion problem in distributed information retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient and effective metasearch for a large number of text databases

Proceedings of the eighth international conference on Information and knowledge management
A general language model for information retrieval

Proceedings of the eighth international conference on Information and knowledge management
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Title language model for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A language modeling framework for resource selection and results merging

Proceedings of the eleventh international conference on Information and knowledge management
Exploiting Hierarchy in Text Categorization

Information Retrieval
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Determining Text Databases to Search in the Internet

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
A Comparison of Techniques for Selecting Text Collections

ADC '00 Proceedings of the Australasian Database Conference
Bayesian extension to the language model for ad hoc information retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
Word sequence kernels

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
The search broker

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Federated Search

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number and diversity of distributed Web databases on the Internet exponentially increase, it is difficult for user to know which databases are appropriate to search. Given database language models that describe the content of each database, database selection services can provide assistance in locating databases relevant to the information needs of users. In this paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach is that, for databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art database selection approaches.