Exploiting correlated keywords to improve approximate information filtering

Authors:
Christian Zimmer;Christos Tryfonopoulos;Gerhard Weikum
Affiliations:
Max-Planck Institut for Informatics, Saarbrücken, Germany;Max-Planck Institut for Informatics, Saarbrücken, Germany;Max-Planck Institut for Informatics, Saarbrücken, Germany
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 24
Cited 6

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A decision-theoretic approach to database selection in networked IR

ACM Transactions on Information Systems (TOIS)
The SIFT information dissemination system

ACM Transactions on Database Systems (TODS)
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Hermes: A Distributed Event-Based Middleware Architecture

ICDCSW '02 Proceedings of the 22nd International Conference on Distributed Computing Systems
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
SCRIBE: The Design of a Large-Scale Event Notification Infrastructure

NGC '01 Proceedings of the Third International COST264 Workshop on Networked Group Communication
P2P-DIET: an extensible P2P service that unifies ad-hoc and continuous querying in super-peer networks

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Sketching streams through the net: distributed approximate query tracking

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Consistently estimating the selectivity of conjuncts of predicates

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Internet scale string attribute publish/subscribe data networks

Proceedings of the 14th ACM international conference on Information and knowledge management
Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Web text retrieval with a P2P query-driven index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Architectural Alternatives for Information Filtering in Structured Overlays

IEEE Internet Computing
LibraRing: an architecture for distributed digital libraries based on DHTs

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
MinervaDL: an architecture for information retrieval and filtering in distributed digital libraries

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries

Information filtering and query indexing for an information retrieval model

ACM Transactions on Information Systems (TOIS)
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
A peer-selection algorithm for information retrieval

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
KMV-peer: a robust and adaptive peer-selection algorithm

Proceedings of the fourth ACM international conference on Web search and data mining
Peer-to-peer web search: euphoria, achievements, disillusionment, and future opportunities

From active data management to event-based systems and more
A Survey of Automatic Query Expansion in Information Retrieval

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information filtering, also referred to as publish/subscribe, complements one-time searching since users are able to subscribe to information sources and be notified whenever new documents of interest are published. In approximate information filtering only selected information sources, that are likely to publish documents relevant to the user interests in the future, are monitored. To achieve this functionality, a subscriber exploits statistical metadata to identify promising publishers and index its continuous query only in those publishers. The statistics are maintained in a directory, usually on a per-keyword basis, thus disregarding possible correlations among keywords. Using this coarse information, poor publisher selection may lead to poor filtering performance and thus loss of interesting documents.1 Based on the above observation, this work extends query routing techniques from the domain of distributed information retrieval in peer-to-peer (P2P) networks, and provides new algorithms for exploiting the correlation among keywords in a filtering setting. We develop and evaluate two algorithms based on single-key and multi-key statistics and utilize two different synopses (Hash Sketches and KMV synopses) to compactly represent publishers. Our experimental evaluation using two real-life corpora with web and blog data demonstrates the filtering effectiveness of both approaches and highlights the different tradeoffs.