CS Graduate Seminar
October 17, 2003

Speaker: Dr. Karen Villaverde

Title: Toward Automatic Management of Embarrassingly Parallel Applications

Abstract:
Large scale applications that require executing very large numbers of tasks are only feasible through parallelism. In this talk we present a system that automatically handles large numbers of experiments and data in the context of machine learning. The system controls all experiments, including re-submission of failed jobs and relies on available resource managers to spawn jobs through pools of machines. The results show that we can manage a very large number of experiments, using a reasonable amount of idle CPU cycles, with very little user intervention. The condor system used in this project will also be reviewed. Condor is a free system created by Madison-Winsconsin that allows automatic submission of jobs to a group of machines. The jobs are run on the idle CPU cycles of a group of machines, i.e., when the machines are not being used, and automatically migrates jobs to other machines with idle CPU cycles when the machines currently used start being active again.