CS Graduate Seminar
October 17, 2003
Speaker: Dr. Karen Villaverde
Title: Toward Automatic Management of Embarrassingly Parallel Applications
Abstract:
Large scale applications that require executing very large numbers of
tasks
are only feasible through parallelism. In this talk we present a
system that
automatically handles large numbers of experiments and data in the
context
of machine learning. The system controls all experiments, including
re-submission
of failed jobs and relies on available resource managers to spawn jobs
through pools
of machines. The results show that we can manage a very large number of
experiments, using a reasonable amount of idle CPU cycles, with very
little
user intervention. The condor system used in this project will also be
reviewed.
Condor is a free system created by Madison-Winsconsin that allows
automatic
submission of jobs to a group of machines. The jobs are run on the
idle CPU
cycles of a group of machines, i.e., when the machines are not being
used,
and automatically migrates jobs to other machines with idle CPU
cycles when
the machines currently used start being active again.