Particle Physics Linux Condor Batch Farm

For support email the pp unix admin team which will create a ticket and be seen by all the team.

The Basics

The new batch farm is running CentOS 7 with approximately One Peta Byte network attached storage and is accessible through the "interactive" login node pplxint10. Please follow log-in instructions as described here..

At present, the test CentOS 7 test farm consists of one interactive node and two CentOS 7 worker nodes. The HTCondor is the job scheduler for the new batch farm.

All work nodes are configured with 4GB of RAM per CPU core and approximately 50GB of local scratch disk per CPU core.

Please note that interactive login to the worker nodes is disabled for all users.

HTCondor Quick Start Guide

To submit jobs to the HTCondor batch system, login onto a Particle Physics “interactive” node pplxint10; give a file containing commands that tell it how to run jobs. The batch system will locate a worker node that can run each job within the pool of worker nodes, and the output is returned to the interactive node pplxint10.

Submitting a Job

A submit file is required that sets environment variables for the HTcondor batch queue and which calls an executable, for example a submit file myjob.submit below runs hello.py in the batch queue.

hello.py example file:

#!/usr/bin/python import platform host=platform.node() print "Hello World - ", host print "finished"

Make the script executable first by running:

chmod +x hello.py

Submit File

An example myjob.submit file:

executable = hello.py universe = vanilla output = output/results.output. $(Process) error = error/results.error. $(Process) log = log/results.log. $(Process) notification = never should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue 1

Where:

  • executable: The script or command that HTCondor runs.
  • output: where the STDOUT of the command or script should be written to. This can be a relative or absolute path. Please note that the directory “output” will not create the directory, and will error if the directory does not exist.
  • error: where the STDERR of the command or script would be written to. Same rules apply as output.
  • log: This is the output of HTCondor's logs for the job. It will show the submission times, execution host and times, and on termination, it will show the stats. Please note it is not logging for the executable, hello.py, in the example above.
  • queue: This schedules the job. It becomes more important (along with the interpolation) when queue is used to schedule multiple jobs by taking an integer as a value.
  • The commands: should_transfer_files = Yes and when_to_transfer_output = ON_EXIT
    tells HTCondor to explicitly send the files, including the executable, to the worker nodes where the job executes.

Submitting the job

A job is added to the HTCondor queue using condor_submit in order to be executed.

On pplxint10, simply run the command:

$ condor_submit myjob.submit Submitting job(s). 1 job(s) submitted to cluster 70.

Monitoring the job

The condor_q command prints a listing of all the jobs currently in the queue. For example, a short time after submitting “myjob.submit” job from pplxint10, output appears as

$ condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 70.0 davda 2/13 10:49 0+00:00:03 R 0 97.7 myjob.submit 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

The queue might contain many jobs.

To see your own jobs, use an option “–submitter” with your unix login to the condor_q command. See example below, only show davda’s jobs

$ condor_q -submitter davda

Rather than monitoring the job using repeated running of condor_q command, use condor_wait command:

$ condor_wait -status log/results.70.log 70.0.0 submitted 70.0.0 executing on host <163.1.136.221:9618?addrs=163.1.136.221-9618+[--1]-9618&noUDP&sock=1232_a0c4_3> 70.0.0 completed All jobs done.

Removing a job

Successfully submitted jobs will occasionally need to be removed from the queue. Use the condor_rm command specifying the job identifier as a command line argument. For example, remove job number 70.0 from the queue with

$ condor_rm 70.0

Categories: ppunix