Data sharing and information retrieval in wide-area distributed systems

  • Authors:
  • Chunqiang Tang;Sandhya Dwarkadas

  • Affiliations:
  • -;-

  • Venue:
  • Data sharing and information retrieval in wide-area distributed systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This dissertation addresses two problems related to the management of data and information in wide-area distributed systems: distributed shared state and peer-to-peer information retrieval. Distributed applications typically resort to ad-hoc protocols built on top of remote invocation (e.g., Sun RPC or Java RMI) to maintain the coherence and consistency of shared state—information needed at more than one site. We in stead propose to automate the management of shared state. As a complement rather than a replacement to remote invocation, our InterWeave system provides a unified programming environment that supports the use of shared-memory programming, remote invocation, relaxed coherence models, and transactions in a single application. InterWeave is the first system that automates the typesafe sharing of structured data in its internal form across heterogeneous platforms and multiple languages. Our evaluations show that InterWeave introduces minimal overhead while reducing bandwidth consumption and improving performance in important cases. Another problem we study in this dissertation is peer-to-peer information retrieval (P2P IR). P2P systems have gained tremendous interest in recent years, but full-text search of information stored in P2P systems still remains particularly challenging. We address this challenge by taking an interdisciplinary approach, making innovations in multiple fields—networks, systems, IR, and databases—when designing components of our systems. What underlie our solutions are document clustering (i.e., indices stored on a node share similar features) and complete local indexing (i.e., if a node is involved in hosting the index for a document, it always stores the complete index for the document). Document clustering helps limit a search to only nodes hosting relevant indices. Complete local indexing allows each node to rank documents in its indices without consulting others. We propose two independent systems, eSearch and pSearch, that display these properties and are built on top of distributed hash tables (DHTs). Our systems take advantage of the semantic information provided by modern IR algorithms. eSearch uses keywords to cluster documents while pSearch uses concepts derived from latent semantic indexing (LSI) for clustering. Both are efficient and achieve retrieval quality comparable to the centralized baselines.