zymake: a computational workflow system for machine learning and natural language processing

Authors:
Eric Breck
Affiliations:
Cornell University, Ithaca, NY
Venue:
SETQA-NLP '08 Software Engineering, Testing, and Quality Assurance for Natural Language Processing
Year:
2008

Citing 3
Cited 1

UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Wings for Pegasus: creating large-scale scientific applications using semantic representations of computational workflows

IAAI'07 Proceedings of the 19th national conference on Innovative applications of artificial intelligence - Volume 2
Identifying expressions of opinion in context

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Dyna: extending datalog for modern AI

Datalog'10 Proceedings of the First international conference on Datalog Reloaded

Quantified Score

Hi-index	0.00

Visualization

Abstract

Experiments in natural language processing and machine learning typically involve running a complicated network of programs to create, process, and evaluate data. Researchers often write one or more UNIX shell scripts to "glue" together these various pieces, but such scripts are suboptimal for several reasons. Without significant additional work, a script does not handle recovering from failures, it requires keeping track of complicated filenames, and it does not support running processes in parallel. In this paper, we present zymake as a solution to all these problems. zymake scripts look like shell scripts, but have semantics similar to makefiles. Using zymake improves repeatability and scalability of running experiments, and provides a clean, simple interface for assembling components. A zymake script also serves as documentation for the complete workflow. We present a zymake script for a published set of NLP experiments, and demonstrate that it is superior to alternative solutions, including shell scripts and makefiles, while being far simpler to use than scientific grid computing systems.