Agreement based source selection for the multi-topic deep web integration

Authors:
Manishkumar Jha;Raju Balakrishnan;Subbarao Kambhampati
Affiliations:
Arizona State University, Tempe AZ;Arizona State University, Tempe AZ;Arizona State University, Tempe AZ
Venue:
Proceedings of the 17th International Conference on Management of Data
Year:
2011

Citing 24
Cited 0

Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Analyses of multiple evidence combination

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search

IEEE Transactions on Knowledge and Data Engineering
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Keyword Searching and Browsing in Databases using BANKS

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
A Frequency-based Approach for Mining Coverage Statistics in Data Integration

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
When one sample is not enough: improving text database selection using shrinkage

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Federated text retrieval from uncooperative overlapped collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Truth Discovery with Multiple Conflicting Information Providers on the Web

IEEE Transactions on Knowledge and Data Engineering
Searching the deep web

Communications of the ACM
Exploiting web search engines to search structured databases

Proceedings of the 18th international conference on World wide web
Integrating conflicting data: the role of source dependence

Proceedings of the VLDB Endowment
Tracking the random surfer: empirically measured teleportation parameters in PageRank

Proceedings of the 19th international conference on World wide web
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 19th international conference on World wide web
Global detection of complex copying relationships between sources

Proceedings of the VLDB Endowment
Factal: integrating deep web based on trust and relevance

Proceedings of the 20th international conference companion on World wide web
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Heterogeneous network-based trust analysis: a survey

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

One immediate challenge in searching the deep web databases is source selection---i.e. selecting the most relevant web databases for answering a given query. For open collections like the deep web, the source selection must be sensitive to trustworthiness and importance of sources. Recent advances solve these problems for a single topic deep web search adapting an agreement based approach (c.f. SourceRank [10]). In this paper we introduce a source selection method sensitive to trust and importance for multi topic deep web search. We compute multiple quality scores of a source tailored to different topics, based on the topic specific crawl data. At the query time, we classify the query to determine its probability of membership in different topics. These fractional memberships are used as the weights to the topic specific quality scores of sources to select sources for the query. Extensive experiments on more than a thousand sources in multiple topics show 18-85% improvements in result quality over Google Product Search and other existing methods.