Studying the clustering paradox and scalability of search in highly distributed environments

Authors:
Weimao Ke;Javed Mostafa
Affiliations:
Drexel University, Philadelphia, PA;University of North Carolina at Chapel Hill, Chapel Hill, NC
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2013

Citing 49
Cited 0

The effectiveness of GIOSS for the text database discovery problem

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Applications of intelligent agents

Agent technology
Evaluating database selection techniques: a testbed and experiment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Comparing the performance of database selection algorithms

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
An agent-based approach for building complex software systems

Communications of the ACM
Community-based service location

Communications of the ACM
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Modeling score distributions for combining the outputs of search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Models for metasearch

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Retrieving Information from a Distributed Heterogeneous Document Collection

Information Retrieval
Self-Organization and Identification of Web Communities

Computer
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
SETS: search enhanced by topic segmentation

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Searching social networks

AAMAS '03 Proceedings of the second international joint conference on Autonomous agents and multiagent systems
Peer-to-peer information retrieval using self-organizing semantic overlay networks

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
A semisupervised learning method to merge search engine results

ACM Transactions on Information Systems (TOIS)
Ad Hoc, self-supervising peer-to-peer search networks

ACM Transactions on Information Systems (TOIS)
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Server selection methods in hybrid portal search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling search engine effectiveness for federated search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Towards scatter/gather browsing in a hierarchical peer-to-peer network

Proceedings of the 2005 ACM workshop on Information retrieval in peer-to-peer networks
Semantic link based top-K join queries in P2P networks

Proceedings of the 15th international conference on World Wide Web
Social networks, incentives, and search

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
ProbFuse: a probabilistic approach to data fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
User modeling for full-text federated search in peer-to-peer networks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
ALVIS peers: a scalable full-text peer-to-peer retrieval engine

P2PIR '06 Proceedings of the international workshop on Information retrieval in peer-to-peer networks
Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology)

Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology)
Full-text federated search in peer-to-peer networks

ACM SIGIR Forum
Federated text retrieval from uncooperative overlapped collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Web text retrieval with a P2P query-driven index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Peer-to-peer similarity search over widely distributed document collections

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Dynamicity vs. effectiveness: studying online clustering for scatter/gather

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Server selection methods in personal metasearch: a comparative empirical study

Information Retrieval
The web as a graph: measurements, models, and methods

COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics
Modern Information Retrieval

Modern Information Retrieval
Advanced Metasearch Engine Technology

Advanced Metasearch Engine Technology
Scalability of findability: effective and efficient IR operations in large information networks

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Federated Search

Foundations and Trends in Information Retrieval
Semantic overlay networks for p2p systems

AP2PC'04 Proceedings of the Third international conference on Agents and Peer-to-Peer Computing
A survey and comparison of peer-to-peer overlay network schemes

IEEE Communications Surveys & Tutorials

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the ubiquitous production, distribution and consumption of information, today's digital environments such as the Web are increasingly large and decentralized. It is hardly possible to obtain central control over information collections and systems in these environments. Searching for information in these information spaces has brought about problems beyond traditional boundaries of information retrieval (IR) research. This article addresses one important aspect of scalability challenges facing information retrieval models and investigates a decentralized, organic view of information systems pertaining to search in large-scale networks. Drawing on observations from earlier studies, we conduct a series of experiments on decentralized searches in large-scale networked information spaces. Results show that how distributed systems interconnect is crucial to retrieval performance and scalability of searching. Particularly, in various experimental settings and retrieval tasks, we find a consistent phenomenon, namely, the Clustering Paradox, in which the level of network clustering (semantic overlay) imposes a scalability limit. Scalable searches are well supported by a specific, balanced level of network clustering emerging from local system interconnectivity. Departure from that level, either stronger or weaker clustering, leads to search performance degradation, which is dramatic in large-scale networks.