Particle Physics Linux Condor Batch Farm
For support email the pp unix admin team which will create a ticket and be seen by all the team.
The Basics
The batch farm is running Linux Enterprise 9 (EL9) with approximately 1.8 Peta Byte network attached storage and is accessible through the "interactive" login nodes pplxint12 and pplxint13. Please follow log-in instructions as described here.
The EL9 cluster consists of two interactive nodes and number of EL9 worker nodes. The HTCondor is the job scheduler for the batch farm.
All work nodes are configured with 4GB of RAM per CPU core and approximately 50GB of local scratch disk per CPU core.
Please note that interactive login to the worker nodes is disabled for all users.
HTCondor Quick Start Guide
To submit jobs to the HTCondor batch system, login onto either Particle Physics “interactive” nodes pplxint12 or pplxint13; create a file containing commands that tell it how to run jobs. The batch system will locate a worker node that can run each job within the pool of worker nodes, and the output is returned to the interactive nodes.
Submitting a Job
A submit file is required that sets environment variables for the HTCondor batch queue and which calls an executable, for example a submit file myjob.submit below runs hello.py in the batch queue.
hello.py example file:
Make the script executable first by running:
Submit File
An example myjob.submit file:
Where:
- executable: The script or command that HTCondor runs.
- getenv: source user's login environment
- output: where the STDOUT of the command or script should be written to. This can be a relative or absolute path. Please note that the directory “output” will not be created, and will error if the directory does not exist.
- error: where the STDERR of the command or script would be written to. Same rules apply as output.
- log: This is the output of HTCondor's logs for the job. It will show the submission times, execution host and times, and on termination, it will show the stats. Please note it is not logging for the executable, hello.py, in the example above.
- queue: This schedules the job. It becomes more important (along with the interpolation) when queue is used to schedule multiple jobs by taking an integer as a value.
Submitting the job
A job is added to the HTCondor queue using condor_submit in order to be executed.
On pplxint12, simply run the command:
Memory and CPU estimates
The condor batch system allocates one CPU core and 4GB of Memory per job. If your job requires multiple CPU core or more memory then it can requested in Job submit file. For example, if your job needs 3 CPU Core and 8 GB memory then the requirement can be added to job submit form like this
Monitoring the job
The condor_q command prints a listing of all the jobs currently in the queue. For example, a short time after submitting “myjob.submit” job from pplxint12, output appears as
The queue might contain many jobs.
To see your own jobs, use an option “–submitter” with your unix login to the condor_q command. See example below, only show davda’s jobs
Rather than monitoring the job using repeated running of condor_q command, use condor_wait command:
Removing a job
Successfully submitted jobs will occasionally need to be removed from the queue. Use the condor_rm command specifying the job identifier as a command line argument. For example, remove job number 70.0 from the queue with
Data transfer jobs
To submit jobs to a normal queue, See an example below.
The commands to transfer files should be in a script and must be executable. For example, this can be:
For internal:
For external:
More often it is used to transfer to/from the Grid. To submit a job, prepare a script which is capable of transferring your data correctly. Add the following line after the "#!/bin/bash" in your script:
The above command instructs the Grid tools to look for your Grid proxy credentials. The location of your Grid proxy credentials must be accessible to both the interactive machines and the worker nodes.
Before you submit the job, you need to initialize your Grid proxy into the file indicated by the X509_USER_PROXY environment variable. The proxy initialization command varies from experiment to experiment.
To submit the job script, you should therefore execute the following commands on pplxint12/13.
For very long jobs, you may need to refresh your Grid proxy periodically. The proxy normally lasts about 12 hours.
Useful links
- Instituto de Astrofisica de Canarias How to us Condor: http://www.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.Condor
- HTCondor Python Binding tutorial: https://htcondor-python.readthedocs.io/en/latest/htcondor_intro.html
Categories: Particle