HadoopPerceptron: a toolkit for distributed perceptron training and prediction with MapReduce

Authors:
Andrea Gesmundo;Nadi Tomeh
Affiliations:
University of Geneva, Geneva, Switzerland;LIMSI-CNRS and Université Paris-Sud, Orsay, France
Venue:
EACL '12 Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 7
Cited 1

Large Margin Classification Using the Perceptron Algorithm

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
Distributed training strategies for the structured perceptron

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Unexpected challenges in large scale machine learning

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a set of open-source software modules to perform structured Perceptron Training, Prediction and Evaluation within the Hadoop framework. Apache Hadoop is a freely available environment for running distributed applications on a computer cluster. The software is designed within the Map-Reduce paradigm. Thanks to distributed computing, the proposed software reduces substantially execution times while handling huge data-sets. The distributed Perceptron training algorithm preserves convergence properties, thus guaranties same accuracy performances as the serial Perceptron. The presented modules can be executed as stand-alone software or easily extended or integrated in complex systems. The execution of the modules applied to specific NLP tasks can be demonstrated and tested via an interactive web interface that allows the user to inspect the status and structure of the cluster and interact with the MapReduce jobs.