Hydra, Theoretical Physics' compute cluster


Hydra: Some of the compute nodes

Hydra is a linux-based, 10 TFLOP Beowulf cluster consisting of 113, dual-cpu, multi-core, Intel-CPU based machines for people in Theory who need large amounts of CPU time. We have a total of 1,560 CPU cores on which to run programs, either MPI-based or normal "serial" programs. You can connect via SSH or RDP.
There are various scientific libraries/bits of software installed but you can always ask for whatever you want on there.
You can get an account if you're in the Theory department and your supervisor says you can have an account, so get in touch with them or contact Jonathan Patterson ( jop@astro.ox.ac.uk ) direct for more info/to setup an account.


HARDWARE

The machine lives at the Begbroke Science Park and the OS used is CentOS (very similar to RedHat).
Each node is connected by standard gigabit ethernet for normal network tasks like NFS (over which your home directories are connected), and for MPI messages if you're running a parallel MPI program.
If you want to run a "normal" (ie. non-MPI, non-OpenMP) program lots of times with different parameters, ( a lot of you do exactly that ) then I have a program which can make that pretty easy - "multirun". Please email me for more info. It's easier than converting your current "normal" program to MPI.
We've got 30TB of disk storage kept in a RAID 6 array. This is backed up every day off-site. We also have a distributed filesystem ( 30 TB ) for people who need faster access to their files.

GPUs

Attached to this cluster is also the machine "tesla" which has an nVidia Tesla C2050 card for GPU work, and gpu1 and gpu2 which have a GeForce GTX 1080 Ti (12 GB RAM) each. You can use these by connecting straight to them and running manually, eg. "ssh tesla" or "ssh gpu1". Cuda is installed in /usr/local/cuda. There's a 66 GB area for temporary files on an SSD at /scratch. Let me know if you're missing something you need on these machines...

WHERE TO KEEP FILES

There are two storage areas on Hydra. There's your home directory, which is backed up, and there is /mnt/extraspace, which is the distributed filesystem, and is not backed up. /mnt/extraspace should be used for when you have to write large amounts from multiple programs at the same time as it should be less prone to getting overloaded than the home directories. Please don't keep important files in it though, as it is not backed up (did I mention that already?!), is slightly 'experimental' and therefore should not be trusted with your most important files. It's an 'overspill' area or for when you have to write a lot, fast.

COMPILING

We've got two sets of compilers - the gnu compilers (gcc,gfortran,etc) and the Intel compilers. Typing gcc,gfortran,etc gets you version 4.4.7 - the one that comes as standard with our operating system (CentOS Linux). If you'd like a newer version of the compiler, there are a few different versions available - "module avail" will give you the latest list.
There are also the Intel compilers, with various versions available. These can generate programs that run up to twice the speed in some cases, so they're worth trying ("module load intel-compilers", then "icc, ifort", etc).
If you're compiling an MPI program, the following commands will call the compiler, linking in the necessary MPI libraries:
mpicc, mpicxx, mpif90. These will use whichever compiler you have selected with "module load".
Feel free to email me (jop@astro.ox.ac.uk) if you have any problems compiling - I'm always helping people find the right libraries and get things compiled ok. A little struggling with getting things to compile yourself is character-building and may teach you something, but please feel free to get me to help when you get stuck - you don't want to be wasting time trying to compile things when there's Science to be done, right?

LIBRARIES


There are lots of libraries/programs available. We use the "Environment Modules" software to manage which ones you want to be available, so:
module avail lists what software modules are available
module list shows which ones you have loaded
module load moduleName loads a module,
eg. "module load intel-compilers" selects the intel compilers
"module load fftw/2.1.5" would load version 2.1.5 of the fftw libraries, and just "module load fftw" would load the default version (3.3.4)
module unload moduleName removes a module
module initadd/initrm/initlist adds/removes/lists which modules are loaded each time you login. So if you use something all the time, add it using that.

The MKL libraries are available, for fast BLAS and LAPACK routines (module load intel-compilers). These libraries are in $MKLROOT, and there's a webpage here to help you choose which ones to link against.
Numerous other libraries are available - "module avail" to list them, so have a look in there to see if the library you want is available, but if not I can install it.
Some of the software we have available is: AIPS, ATLAS, CFITSIO, FFTW, HDF5, GSL, IDL, MATHEMATICA, MATLAB, OBIT, PYTHON
Feel free to email me (jop@astro.ox.ac.uk) if you have any problems compiling.

MPI libraries


MPI is "Message Passing Interface" - library functions for sending & receiving messages between processes. Typically, programs use these to exchange information about which process should be doing what to what - distributing the work.
So if your program uses this, you will need to have one of the MPI libraries selected - either "mpi" or "mpi/intel" if you're using the intel compilers, for example.

THE JOB QUEUE

We use the slurm queueing system, with some extra queueing/display software I've written which runs on top of it in an effort to make sure everyone gets a fair share of time on the cluster.
There are 3 queues you can submit jobs to - "short", "long" and "bigmem". You can select the queue with the "-q" option to addqueue, eg. "addqueue -q bigmem".
The short queue is for quick test runs and the long queue is for everything else. Anything running on the short queue for >2hrs will be stopped if another person's job is waiting to run on it.
The bigmem queue is for jobs that need few cores but lots of memory (roughly > 8 GB per process). Currently this just has one node in it - a 20-core node with 126 GB RAM. This queue exists because a job in the long queue which has asked for large amounts of memory can be waiting a very long time for a suitable machine(s) to become free. This can hold up other jobs behind it in the queue. So if you have big memory needs, please use this queue first.
The main ("long") queue is for everything else. It is interleaved so that people who have used it the least have more jobs at the top of the queue than people who have used it the most. That way, everyone should be able to get something running, and the smaller users should not be pushed out of the queue by the biggest users.
There's also another rule - there is a maximum number of CPU cores that any one person can use at a time. I vary this depending on how busy the cluster is.
It's all a little complicated, and has evolved over the years to try and keep things as close to "fair" as is practical.

MEMORY


Jobs are allocated 0.1 GB of RAM unless you ask for more with the -m flag to addqueue (see below). If you exceed the memory limit, you'll get a message in the job's output file to say so, though sometimes it will just say "Killed.". You can check how much memory it used with the q command.

SUBMITTING YOUR JOB TO THE QUEUE


You can use q -tw 1 to give a list of the compute nodes and how many cores/GB they have free right now. This will tell you what kind of job it's possible to run right now.
addqueue -c "runtimeEstimate or comment" ./myProgramName myParameters
You can just type "addqueue" to get the different options, but here are some common ones:
eg.
addqueue -c "2 days" -n 2x8 ./doSomeAnalysis
will submit your program "doSomeAnalysis" to the default queue, requesting 2 compute nodes with 8 cores on each (so 16 processes are started). You can also use the runtimeEstimate to add your own comments about the job - parameter numbers, maybe, to keep track of which job is which.
addqueue -c "1 week" -m 3 ./doSomeAnalysis
would run a single-process (non-MPI) job with 3GB RAM allocated to it
OR
addqueue -c "1 week" -n 31 ./doSomeAnalysis
would submit a 31-core job to the queue with no mention of how many cores to run on each compute node - so they'll be allocated wherever the queue sees fit. It only really matters how the cores are allocated if your job uses a lot of MPI communication, in which case you'll want as many of them as possible running on each compute node because if the MPI traffic has to go over the ethernet it will be quite slow.
Once you've submitted it you'll see the job number which was allocated to it.
OpenMP
If you are using OpenMP then, to make sure your program runs happily alongside others, you should use the -s option for addqueue, as below. This will set the OMP_NUM_THREADS environment variable so that OpenMP uses the correct number of threads.
addqueue -s -c "test" -n 1x8 myProgram myProgramArguments
which would tell OpenMP to use 8 cores.
If in doubt, feel free to ask me.

Mathematica, Matlab
addqueue -n 1x3 -l -s /usr/local/shared/mathematica/10.1/bin/math -script scriptname.m
Would reserve 3 cores on the same compute node, and only run 1 occurance (-s) of mathematica at a reduced priority (-l), telling it to run your script scriptname.m
With Matlab specifically, I've seen cases where using parfor or allowing matlab to use all the cores available actually slows the code down, or at least runs at the same speed. So it's worth trying with the -singleCompThread option to matlab. If this runs at the same speed as without it, then please use it, otherwise matlab will use more cores than it needs to. Let me know if you'd like some help with this.

CHECKING YOUR JOB

The command q will list your jobs in the queue.


If you have X-windows forwarding enabled, or are using an RDP connection to hydra, then you will get a graphical GUI showing you the jobs.
There are various buttons to play with, and the main list of jobs shows all sorts of job info, including current usage stats for the jobs - how many cores, memory GB, diskIO, etc it's using right now.


You can then double-click on a job to show the job's output file, or right-click to get various options, eg. show job statistics, cancel job, etc
If you select "show job stats", then you will get a window showing CPU, MEMORY, DISKIO usage, and you can click on the icons there to graph these over time. You can use this to check that your job ran efficiently, had enough memory, etc.

If you don't have X-windows on, or you type "q -t", then you'll get a text-only version.
You can check the output (stdout,stderr, or "what would have been printed to the screen") of your job by typing:
q -s jobNunber or (text-only)
showoutput jobNumber
You can stop your job with scancel jobnumber at any time if you want, or via the q GUI interface.
You can also do scancel -u myUserName to cancel all your jobs.
If your job runs longer than you estimated, please update your estimated runtime with the comment command, eg.
comment jobNumber "New comment"
Feel free to login to the compute node running your job and see how it's doing. Have a look at which compute nodes it's running on and then (to see node 12, for example):

ssh comp12
top
exit

If your job sits in the queue waiting for a long time

You can use the command q -tw to show information on what your job is asking for, which compute nodes match that spec., and how many are currently free. This is a good thing to run if your job hasn't started running for many minutes, to see if you're asking for too many resources (and so would have to wait a really long time for the job to start), or to have a guess at when it may start running. You can also run this with a jobnumber of 1, which will show you the current state of the nodes - how many cores & GB they have free. You could use this to see what it's possible to run before you even submit a job.

So that's the basic info for Hydra. In practice, people have problems compiling, running, and things break sometimes, so please feel free to email me (jonathan.patterson@physics.ox.ac.uk) if you have any questions / think something is wrong.

Categories: Theory