Scalability of the Nutch search engine

Authors:
José E. Moreira;Maged M. Michael;Dilma Da Silva;Doron Shiloach;Parijat Dube;Li Zhang
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the 21st annual international conference on Supercomputing
Year:
2007

Citing 13
Cited 7

Mean value technique for closed fork-join networks

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Workload Service Requirements Analysis: A Queueing Network Optimization Approach

MASCOTS '02 Proceedings of the 10th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Building Nutch: Open Source Search

Queue - Search Engines
How to build a WebFountain: An architecture for very large-scale text analytics

IBM Systems Journal
Lucene in Action (In Action series)

Lucene in Action (In Action series)
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
BladeCenter system overview

IBM Journal of Research and Development - IBM BladeCenter systems
BladeCenter midplane and media interface card

IBM Journal of Research and Development - IBM BladeCenter systems
BladeCenter processor blades, I/O expansion adapters, and units

IBM Journal of Research and Development - IBM BladeCenter systems
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging
Characterization of simultaneous multithreading (SMT) efficiency in POWER5

IBM Journal of Research and Development - POWER5 and packaging
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Towards the Design of a Scalable Email Archiving and Discovery Solution

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Metadata domain-knowledge driven search engine in "HyperManyMedia" E-learning resources

CSTST '08 Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology
Metadata as seeds for building an ontology driven information retrieval system

International Journal of Hybrid Intelligent Systems
Expresses-an-opinion-about: using corpus statistics in an information extraction approach to opinion mining

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
An innovative approach to the development of E-government search services

EGOVIS'11 Proceedings of the Second international conference on Electronic government and the information systems perspective
Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Journal of Parallel and Distributed Computing
ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nutch is an open source search engine that is gaining increasing popularity in the commercial world. The Nutch architecture leads itself to a wide range of parallelization techniques. Multiple backend servers can be used to both partition the corpus of search data, thus increasing the rate of queries serviced, and to increase the size of the search data while preserving the service rate. Alternatively, multiple search engines can operate in parallel, further increasing the query rate. In this paper, we analyze the performance and scalability of various configurations of Nutch. The configurations were implemented as part of the Commercial Scale Out project at IBM Research, and were used to investigate the applicability of scale-out architectures in commercial environments. We conclude that Nutch is highly scalable, with the different configurations behaving differently from a performance perspective.