Running Computing Jobs: Why?

Why jobs must be run nice'd

CPU cycles are a limited resource, and all modern operating systems permit multiple processes (i.e. programs) to run simultaneously. The operating system manages this by dividing CPU cycles into time slices and allocating these to programs according to a system of priority.

The priority at which a process runs is determined by the OS, based on the amount of resources that that process appears to be requesting. All processes are allocated time in this way, be they belonging to the system, or a particular user. So, if you run a numerical job that is CPU cycle limited, it will always be requesting more CPU time, and its priority will be maximised. This makes it compete with other processes requesting more CPU time, and the OS will generally allocate these processes an equal share.

This becomes problematic if very many numerical jobs are running on a machine, all at maximum priority, because the maximum amount of time allocated to any one job becomes very small. If this job happens to be a desktop user's web browser, that browser will slow to a crawl. If the job happens to be the system process responsible to paging unused chunks of memory from physical RAM to the disk, the whole system will slow to a crawl.

To avoid this happening, the OS provides a facility for users to modify the priority of their processes. Typically, a user can only modify the priority downwards. To automatically increment (i.e. reduce) the priority of a task, one prefixes its command with the 'nice' command. This automatically launches the task with a priority increment of 10, i.e. 10 points lower than it would otherwise have. Processes that are already running can be incremented with the 'renice' command.

In Theoretical Physics, we ask all users to run their numerical tasks at nice increment 10. This is most easily accomplished by prefixing everything with the 'nice' command. This not only prevents numerical jobs from crowding out desktop jobs or important system processes, it prevents some numerical jobs from running at the expense of others, because they will all be running at equal priority.

Why you should not use more than half the system RAM on any machine*

Running out of system RAM is a bad idea: the computer will attempt to free up physical RAM paging segments of memory to disk: the computer will slow down dramatically as the disk interface effectively becomes the system memory bus.

Since the maximum number of jobs we allow on a machine is equivalent to a full CPU load from two users, the simple calculus is that, at most, one user should commit to using half the system RAM. You can check how much memory your processes are using, and how much the machine has, by using the 'top' command.

Please note that it is still more than possible to adversely affect systems while remaining within this limit, e.g. if >2 users decide to each use half the system RAM. If you are running memory hungry jobs, it is your responsibility to make sure they behave properly and that the system has sufficient free RAM to run your job. The 'top' command will help you.

* without contacting support first. If you have a particularly memory hungry job, we will try to accommodate you, but you must speak to us first.

Why you should not run more than one job per core

To prevent large numbers of jobs degrading the performance of particular machines, one would not wish to run very many more jobs than the number of cores a machine possesses to process them. One user is therefore forbidden from submitting more concurrent jobs to a particular machine, than the machine has cores to process them.

Why you should not submit jobs to a machine already running 2 jobs per core

People expect a reasonable level of performance from the machines to which they submit their jobs. There is a compromise to be struck between this performance expectation, and the fact that our computing resources are limited and have to be shared.

We think that people should expect at least 50% performance from any machine to which they submit their jobs. Therefore, you may not submit jobs to a machine if that submission will cause the CPU cycle usage of already-running jobs to drop below 50%: i.e. do not submit jobs to a machine already running 2 jobs per core.

You can use the 'top' command to see how much CPU time running jobs have on a particular machine.

Categories: Theory