Running multiple Matlab jobs at the same time

When submitting multiple MATLAB jobs in quick succession via a queueing system, and using the parpool function to do parallel calculations, you may intermittently encounter errors such as:

“Failed to start pool. Error using save. Unable to write to MAT-file mnt/zfsusers/orra/.matlab/local_cluster_jobs/R2018a/ The file may be corrupt.” “Failed to locate and destroy old interactive jobs.” “The storage metadata file does not exist or is corrupt.”

Why does this happen? MATLAB tries to assign each job a unique job-number and settings folder, but it can forget to do so if jobs are launched close to each other. As a result, they write to the same temporary .mat file and corrupt it. To avoid this, you need to generate unique temporary folders explicitly. The easiest way to do this is to name them based on the Job ID (which is held in the environment variable $SLURM_JOB_ID). For example:

cl = parcluster(); %create a cluster object

slurmid = getenv('SLURM_JOB_ID'); %get the unique job ID as seen by Hydra

storage_folder = strcat('/mnt/zfsusers/your_username/matlabtemp/', slurmid);

mkdir(storage_folder); %make a temporary folder for this session

cl.JobStorageLocation = storage_folder; %and tell the cluster to use it

parpool(cl); %create the parallel pool for this cluster

Categories: MATLAB