Particle Physics Linux Batch Farm

For support email the pp unix admin team which will create a ticket and be seen by all the team.

The Basics

The main batch farm is a High Throughput Cluster running Scientific Linux version 6 (SL6) with approximately 1 PB network attached storage and is accessible through the the "interactive" login nodes pplxint8 and pplxint9. The log-in instructions are here.

In total, the SL6 batch farm consists of several hundred CPU cores on which the majority of non-grid analysis work is expected to run. It uses the torque queuing system. If your jobs run on the Scientific Linux 6 batch system, you may also be able to use the particle physics grid if you talk to us at pp_unix_adminATphysics.ox.ac.uk.

You may still run SL5 jobs on the larger SL6 batch system by submitting the job from pplxint8 or 9 and asking for the job to run inside an SL5 chroot (see below). There is also an older, sigificantly smaller Scientific Linux 5 batch system which is accessible from pplxint5 and 6. Many SL5 programs also still run on the grid.

The nodes are configured with 4GB of ram per CPU core and approximately 50GB of local scratch disk per CPU core.

Interactive login to the worker nodes is disabled.

Getting Started

The basic commands are qsub and qstat, syntax for these can be checked from the man pages, e.g.

man qstat

Submitting a job

Your job needs to be started by a small script that sets any environment variables first and then runs the program, for example a script file myjob

####################FILE: myjob##################

#!/bin/sh sleep 30 echo Job Done

####################FILE: myjob##################

Could be submitted with the command qsub myjob. Make your job executable first by running

chmod +x myjob.sh

Checking a job

Progress can be checked by typing

qstat -ans

Killing / Pausing a Job

To delete a job use qdel nnn where nnn is the number returned from the qstat command. The jobs may either be running or in the queue ).

If you have several jobs queuing and you feel it might be fair to let some other peoples jobs get in before you, you can hold jobs in the queue with the command qhold nnn and then release it later with the command qrls nnn. You can also submit low priority jobs using qsub -q lowpriority [jobname.sh]. This may be the oly queue accessible to non-particle physicists using the system.

Other q commands may be useful such as qalter to change the requirements of an already submitted but still queueing job.

Other useful commands

\texttt{qstat -Q -f} will print out details of all the queues.
\texttt{qstat -q} shows the total number of jobs running on all queues.
\texttt{qstat -B} pplxtorque03 gives a summary of the status of the cluster.
\texttt{qstat -ans } gives details of jobs running on the SL5 cluster.
\texttt{pbsnodes -a} gives a list of all the worker nodes and lists the jobs running there. Nodes not running jobs should be listed as free.

Scheduling on the Cluster

The cluster uses the MAUI scheduler which has a fair share mechanism which favours users who have not run recently, so in principle if the system is busy with user A's jobs and user B submits jobs they should run as soon as a node becomes free.

The default queue (input queue) will route jobs to veryshort short or normal queues depending upon demanded cpu time. The input queue will assume 168 hours if you do not specify a time requirement, which will send your jobs to the normal queue. CPU time limits at the time of writing are:

veryshort is up to 2 hours
short is up to 12 hours
normal is between 12 hours and 168 hours (i.e. a week). You can ask for even more time using the instructions below.

These may change if these defaults start to look less sensible. To get the current limits, use the command:


pplxint8# qstat -q

The queues also have different priorities, so jobs in the shorter queues will be submitted before longer jobs when cpu's become available.

We also run several additional queues:

  • datatransfer:- Configured so that data can be transferred to/from the grid efficiently and without disrupting the day-to-day work of people using the interactive machines.
  • testing:- Use if the queues are busy and you have one or two quick test jobs to run (12 hours or less).
  • lowpriority:- Use if you can afford to wait for your job output, as this will allow more urgent jobs from other people to run.
  • jainormal and vsim:- People in the JAI group can use these queues. The jobs will be directed to nodes designed to efficiently process multi core jobs. The vsim queue is limited to 64 cores as that is the number of vsim licenses we have.

The way to use these queues is covered later on this page.

Job Output

At the end of each job, the torque system prints out a number of statistics about what your job used. When submitting tens of jobs, it can be useful to submit a representative test job first and pay attention to these statistics:

pplxint6%> tail my_job.sh.o123456 ********************************************************************* * * Job Terminated at Mon Jun 25 10:12:01 BST 2012 * * Job Used * * cput=00:00:00,mem=15000kb,vmem=193068kb,walltime=00:05:24 * *********************************************************************

Time estimates

You can specify how much time your job requires either in the job script or on the command line. To ask for 100 hours on the command line use:

qsub -l cput=100:00:00 -l walltime=100:00:00 myscript

You can also include the line:

#PBS -l cput=11:50:00 #PBS -l walltime=11:50:00

in the script to ask for 11 hours 50 minutes.

Providing an estimate for the amount of CPU time that a job is likely to take may help your job start sooner. We prioritize shorter jobs over longer ones. Make your estimate conservative, as jobs which exceed the allowed length of the queue are terminated. If no time estimate is given, it is assumed that the job will last for one week.

Memory estimates

In addition to the time estimates above, it is advisable to give other resource estimates. We provision 4GB RAM per job. If your job requires more than 4GB, you may request that your job uses multiple job "slots". Failure to do this may result in any jobs running on the same machine as yours and/or your own jobs failing. Analysis of events using standard experimental software is unlikely to require additional "slots", however some MC simulation and the final stage analysis in "ROOT" often can. Booking many histograms or allocating large arrays, vectors or maps tends to cause this.

To request up to 12 GB ram (3 slots for one job):

qsub -lnodes=1:ppn=3 twelve_gig_job.sh

The most important part to get right is the ppn=3 value and not the pmem as you logically might expect.

Reducing the load on the batch at peak times

We tend to find that the batch system is quite busy during weekdays but less so in the evening or weekend. As a good citizen, you might like to delay the execution of your job until evening or weekends. To do this either specify the earliest start time for your job,

e.g. to run it in the evening:

qsub -a 1800 jobscript.sh

To run a job any time after 18:00 this coming Friday:

qsub -a $(date +%Y%m%d -d "next Friday 18:00") jobscript.sh

Details on the additional queues

Low priority queue

If you have jobs which take a long time, or you can afford to wait for the output, please use the "lowpriority" queue. This will allow people with more urgent workloads to get faster results. Your jobs will run when the system is quieter.

To submit "background" jobs to the lowpriority queue, the queue name should be specified directly on the command line like so. Timing estimates can be included to improve the packing of jobs onto the cluster:

qsub -q lowpriority -lcput=1:00:00 -lwalltime=1:00:00 my_hourlong_job.sh

Testing queue

To run one or two quick test jobs of up to two hours each, please use the "testing" queue. If you want to run some intensive code for more than 5 minutes, this queue is preferred to running code interactively.
To use the "testing" queue, the queue name should be specified directly on the command line like so:

qsub -q testing myjob.sh

Data transfer queue

To submit jobs to the datatransfer queue, the queue should be specified directly on the command line with the parameters "-q datatransfer":

qsub -q datatransfer mysjob.sh

The datatransfer queue is to be used for long, high network bandwidth copy operations. This queue should not be used to run CPU intensive jobs. This can be internal, eg

cp ${HOME}/WZ_Jan-Dec_2012 /data/atlas/big_datasets;

external:

scp -i ~/data_transfer_key myname@lxplus.cern.ch:bigfile /data/myexperiment/bigfile

Finally the usual use case for this is to stage in files to/from the grid. This case is more complicated. To submit a grid job, prepare a script which is capable of transferring yor data correctly. Add the following line to the top of the script:

export X509_USER_PROXY=${HOME}/.gridProxy

This points the grid tools to look for your grid proxy credentials in a location accessible both to the interactive machines and the cluster nodes. Before you submit the job you need to initialize your grid proxy into the file pointed to by the X509_USER_PROXY environment variable. The proxy initialization command used varies from experiment to experiment. To submit the job script, you should therefore execute the following commands on pplxint5/6

export X509_USER_PROXY=~/.gridProxy voms-proxy-init vo.southgrid.ac.uk (or lhcb-proxy-init, or atlas-proxy-init or otherwise) qsub -q datatransfer grid_transfer_job.sh

For very long jobs, you may need to refresh your grid proxy periodically. The proxy normally lasts about 12 hours.

export X509_USER_PROXY=~/.gridProxy

voms-proxy-init vo.southgrid.ac.uk (or lhcb-proxy-init, or atlas-proxy-init or otherwise)

jainormal, jaipriority

These queues control priority access to the John Adams Institute (JAI) machines. Only people in the jai or l4a groups can run on the jai queues.
Submission to the jaipriority queue must be additionally approved by the JAI PIC code manager, currently Nicolas Bourgeois . Jobs submitted to these queues will be directed to nodes designed for efficient multi-core and muti-node processing.

How to run Multi-core and multi-node jobs is covered in the "parallel processing" section.

Running SL5 jobs on SL6 batch machines

You can run code compiled under SL5 on our SL6 nodes. This is possible since we have also make a seperate SL5 'chroot' environment available on the same CPU boxes. We aim to keep this working until SL5 goes out of support some time in 2017.

To utilize this feature, and take advantage of the amny more cores available on the new batch system, be sure to specify that you want this mode when you submit the job like this:

qsub -D/chroots/el5 yourjobscript

This submission command must be run on the SL6 interactive machines pplxint8 or 9, and will take all the same time or queue selection arguments as noted above. The advantage of doing this is that the SL6 batch system is much larger and may process large numbers of jobs faster than on SL5. Note this will stop working some time in 2017 when SL5 finally goes out of support.

Scratch Disks

If your job is likely to be i/o intensive it may be better to copy data sets on to a local scratch area, and work on them from there rather than directly over the network to the data disks.

Each worker node has a local, several hundred GB scratch disk which is mapped on to the environment variable $TMPDIR for each job. This can be used while the job runs but all contents are deleted when the job completes so results stored here must be copied to either home or data disks at the end of the job.

For example a job script could start by copying the program and data to the scratch dir:

################FILE: large_input_data_job ######################

#PBS -l nodes=1 sleep 15 cd $TMPDIR cp /userdisk/gronbech/myprogram . cp /data/zeus/gronbech/mydata . ./myprogram cp ./myresults /data/zeus/gronbech echo Job Done

####################################################################

After completion two files will be left in your login directory large_input_data_job.exx and large_input_data_job.oxx (where xx is the pbs job number) which include the standard error and standard output from the jobs.

Diagnosing Issues

Problems getting the output back can be caused by your .login/.cshrc files trying to output to the screen, this breaks the rcp/scp operation used to copy your results back. You can avoid this by adding the following construct in your ~/.login file

if (\!$\?PBS_ENVIRONMENT) then echo Starting .login file source /etc/group.login source cdf_local.csh endif

What this does is skip past your normal set-up when running as a PBS job an alternative is to check for an interactive shell by looking at the prompt and exiting if there isn't one. This could be added to your .cshrc file for example.

You may also have issues if your home directory quota is filled. Please email us to have it extended.

if (\!($\?prompt)) exit

Parallel processing

The batch system is most efficient for serial workloads. However, the batch system integrates with several multi-core and multi-node processing libraries and utilities.

If your code uses multiple core then as a general rule you need to tell the batch system how many cores to use. This is in addition to any changes you need to make in your own code to set the number of cores.

Anything from 1 to 16 cores can be used if the work is restricted to a single node.

To ensure that the batch system can allocate the required number of CPUs, use the 'ppn' (processors per node) parameter. This can either be in your job script or as an option to qsub:

Job description

\#PBS -o nodes=1:ppn=16

As an option to qsub

qsub -l nodes=1:ppn=16

This will generally be accompanied by corresponding change to your code to specify the required number of threads. Some common examples are below.

A very popular way to parallelize code is to use any of the *mpi utilities or openmp. In this case, all work will be run on a single machine.

openmp

The preferred method is to set the number of threads with an environment variable in your bash script. This way, you can easily use the PBS setting for PPN:
export OMP_NUM_THREADS=$PBS_NUM_PPN in bash
setenv OMP_NUM_THREADS $PBS_NUM_PPN in (t)csh)
or alternatively set the number of threads (eg 16) in C/C++ code with omp_set_num_threads(16).

Open MPI

With MPI jobs, you typically will specify a node file and a number of processes to run in parallel.

mpiexec -f ${PBS_NODEFILE} -n ${PBS_NUM_PPN} script

Extra speed boosts from hyperthreading

Hyperthreading may be utilized to gain an extra 20-30% speed boost to higly parallel jobs. The settings for this are extremely dependent on the architecture we use, and that we effectively define available "jobs slots" by the number of physical CPUs on the machine. The effect of hyperthreading is to reduce the effective speed of each thread running on each machine by about 40%, including those you do not own, though twice as many threads can be used in total. If you want to use hyperthreading, please request exclusive use of the nodes and request a node with 16 job slots. For efficiency reasons please also ensure your code runs with at least 16 threads. The absolute ideal number is 32 (see above examples for openmpi and openmp):

qsub -lnodes=1:ppn=16 yourjob

Multi-node execution

With the exception of JAI users, Multi-node execution, i.e. utilizing over 16 cores should be done with caution. The message exchange between nodes is not generally efficient, so closely 'coupled' parallel threads ran accross multiple nodes may be inefficient.Please discuss requirements with us first.

For JAI workloads, up to 64 cores can be used on a single machine and multiple nodes can be used up to a total of, currently, 128 cores. The jainormal queue must be used for this.

To run parallel jobs over multiple nodes you will need to request multiple nodes in your batch scripts and to use a compatible launcher and message passing interface. Currently a specially compiled "openmpi" toolset is available and integrated with the batch system. It is available as a module only (the system-version does not integrate with the batch system).

In your job script, use module load openmpi/1.8.1
You can then run parallel jobs with message exchange using the mpirun command.
There is an example job script to get you started in /network/software/el6/examples/openmpi/batch
Read the README file first.

Prioritization of jobs

The guiding principles for the batch system are:

  • People running a small number of short jobs are top priority.
  • Shorter jobs get priority over longer jobs.
    • This is partly due to the assumption that the impact of 1 day delay on someone running a week long job is much less than 1 day delay on a 2 hour job. I'll try to keep the start-time delays roughly in proportion to the specified length of job. All delays increasing in busy periods.
    • The second factor is that if the system is needed urgently for some out of bands compute request or some urgent software update we can clear out the system more quickly.
  • Large number of longer jobs may wait in the queue a long time. For these 'bulk' workloads there are actually a number of other options outside of our cluster that would suit just as well.
  • In busy periods weekends and evenings are often still relatively quiet. You may find your longer jobs tending to run out of hours. If the system gets very busy so that people with relatively small numbers of longer jobs cant get a look-in, we may look again at the priority for those with large numbers of shorter jobs.

There are several technical features of the batch system that we do to keep these priorities roughly in the above order. The relative weights and sizes are tuned based on changing usage patterns.

  • Lighter users have priority over heavy ones (automatic).
  • Shorter jobs have priority over longer ones.
  • Reservations of slots only for short/very short jobs and extra reservation for very short.
  • Limits on the total number of normal jobs people can run. Slightly larger limits on short/very short.
  • Special requests can be made to go outside of this system if there is a good reason.

Categories: Linux | PP | Particle | Unix | ppunix