Finding the Most Similar Documents across Multiple Text Databases

Authors:
Clement Yu;King-Lup Liu;Wensheng Wu;Weiyi Meng;Naphtali Rishe
Affiliations:
-;-;-;-;-
Venue:
ADL '99 Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries
Year:
1999

Citing 0
Cited 12

Efficient and effective metasearch for a large number of text databases

Proceedings of the eighth international conference on Information and knowledge management
Towards a highly-scalable and effective metasearch engine

Proceedings of the 10th international conference on World Wide Web
Efficient and effective metasearch for text databases incorporating linkages among documents

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A highly scalable and effective method for metasearch

ACM Transactions on Information Systems (TOIS)
Discovering the representative of a search engine

Proceedings of the tenth international conference on Information and knowledge management
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
Discovering the representative of a search engine

Proceedings of the eleventh international conference on Information and knowledge management
A Methodology to Retrieve Text Documents from Multiple Databases

IEEE Transactions on Knowledge and Data Engineering
A Statistical Method for Estimating the Usefulness of Text Databases

IEEE Transactions on Knowledge and Data Engineering
Distributed mining of classification rules

Knowledge and Information Systems
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
Using word clusters to detect similar web documents

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a methodology for finding the n most similar documents across multiple text databases for any given query and for any positive integer n. This methodology consists of two steps. First, databases are ranked in a certain order. Next, documents are retrieved from the databases according to the order and in a particular way. If the databases containing the n most similar documents for a given query can be ranked ahead of other databases, the methodology will guarantee the retrieval of the n most similar documents for the query. A statistical method is provided to identify databases, each of which is estimated to contain at least one of the n most similar documents. Then, a number of strategies is presented to retrieve documents from the identified databases. Experimental results are given to illustrate the relative performance of different strategies.