Automatic discovery of language models for text databases

Authors:
Jamie Callan;Margaret Connell;Aiqun Du
Affiliations:
Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts;Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts;Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts
Venue:
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Year:
1999

Citing 10
Cited 56

Distributed indexing: a scalable mechanism for distributed information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
The effectiveness of GIOSS for the text database discovery problem

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Learning collection fusion strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
STARTS: Stanford proposal for Internet meta-searching

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating database selection techniques: a testbed and experiment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval: Application Service Definition and Protocol Specification, Z39.50-1995

Information Retrieval: Application Service Definition and Protocol Specification, Z39.50-1995
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases

Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Learning a monolingual language model from a multilingual text database

Proceedings of the ninth international conference on Information and knowledge management
Collection selection and results merging with topically organized U.S. patents and TREC data

Proceedings of the ninth international conference on Information and knowledge management
Discovery of similarity computations of search engines

Proceedings of the ninth international conference on Information and knowledge management
Towards a highly-scalable and effective metasearch engine

Proceedings of the 10th international conference on World Wide Web
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
SDLIP + STARTS = SDARTS a protocol and toolkit for metasearching

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
A highly scalable and effective method for metasearch

ACM Transactions on Information Systems (TOIS)
Mining the web to create minority language corpora

Proceedings of the tenth international conference on Information and knowledge management
Discovering the representative of a search engine

Proceedings of the tenth international conference on Information and knowledge management
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
Extending SDARTS: extracting metadata from web databases and interfacing with the open archives initiative

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Performance Analysis of a Distributed Question/Answering System

IEEE Transactions on Parallel and Distributed Systems
Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
A Statistical Method for Estimating the Usefulness of Text Databases

IEEE Transactions on Knowledge and Data Engineering
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Heterogeneous image database selection on the web

Journal of Systems and Software
Result merging strategies for a current news metasearcher

Information Processing and Management: an International Journal
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
Methods for ranking information retrieval systems without relevance judgments

Proceedings of the 2003 ACM symposium on Applied computing
Distributed information retrieval: a multi-objective resource selection approach

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems - Intelligent information systems
Learning query languages of Web interfaces

Proceedings of the 2004 ACM symposium on Applied computing
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Collection selection for managed distributed document databases

Information Processing and Management: an International Journal
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Knocking the door to the deep Web: integrating Web query interfaces

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Structured databases on the web: observations and implications

ACM SIGMOD Record
Discovering and ranking web services with BASIL: a personalized approach with biased focus

Proceedings of the 2nd international conference on Service oriented computing
Building Minority Language Corpora by Learning to Generate Web Search Queries

Knowledge and Information Systems
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Server selection methods in hybrid portal search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information source selection for resource constrained environments

ACM SIGMOD Record
Automatic structured query transformation over distributed digital libraries

Proceedings of the 2006 ACM symposium on Applied computing
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Distributed query sampling: a quality-conscious approach

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Does pseudo-relevance feedback improve distributed information retrieval systems?

Information Processing and Management: an International Journal
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Using query logs to establish vocabularies in distributed information retrieval

Information Processing and Management: an International Journal
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
CLASCN: candidate network selection for efficient top-k keyword queries over databases

Journal of Computer Science and Technology
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Classification-aware hidden-web text database selection

ACM Transactions on Information Systems (TOIS)
Mining world knowledge for analysis of search engine content

Web Intelligence and Agent Systems
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
SUSHI: scoring scaled samples for server selection

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Server selection methods in personal metasearch: a comparative empirical study

Information Retrieval
Improving the evaluation of web search systems

ECIR'03 Proceedings of the 25th European conference on IR research
Processing queries in a large peer-to-peer system

CAiSE'03 Proceedings of the 15th international conference on Advanced information systems engineering
An effective query relaxation solution for the deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Federated Search

Foundations and Trends in Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Clustering structured web sources: a schema-based, model-differentiation approach

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The proliferation of text databases within large organizations and on the Internet makes it difficult for a person to know which databases to search. Given language models that describe the contents of each database, a database selection algorithm such as GIOSS can provide assistance by automatically selecting appropriate databases for an information need. Current practice is that each database provides its language model upon request, but this cooperative approach has important limitations.This paper demonstrates that cooperation is not required. Instead, the database selection service can construct its own language models by sampling database contents via the normal process of running queries and retrieving documents. Although random sampling is not possible, it can be approximated with carefully selected queries. This sampling approach avoids the limitations that characterize the cooperative approach, and also enables additional capabilities. Experimental results demonstrate that accurate language models can be learned from a relatively small number of queries and documents.