The scheduler (SLURM) | MARCC Tutorials

Important: Extra documentation and guides can be found at marcc-hpc.github.io/esc.

In a previous section, we explained that Blue Crab is a shared resource accessible only via login nodes. Users connect to the login nodes, prepare their calculations, and submit batch jobs to a scheduler.

That scheduler is called SLURM, the simple linux utility for resource management. SLURM controls the compute nodes, and intelligently schedules a vast array of highly heterogeneous calculations in order to maximize utilization (how much of the machine is active at one time), minimize wait times, and allocate the correct resources to each user. SLURM also accounts for the SUs consumed by each research group.

In this section, we will prepare and submit a SLURM job and learn how to choose the correct resources for future jobs.

A simple job

Recall the simple “Hello, World!” bash script from the previous section. In this section we will submit that job for use on a compute cluster. Create a new script using vi script.sh and add the following text. We include the sh prefix to indicate that this is a shell script, however this is not strictly necessary.

#!/bin/bash
#SBATCH -p express
#SBATCH -c 1
#SBATCH -t 0:5:0
set -e
echo "Hello, World!"
sleep 30
echo "Goodbye!"
date

Save and write the file. Each line has a special meaning.

The first line is the hashbang we described earlier. Most SLURM jobs are bash scripts, however you can use other formats.
The next three lines are SBATCH directives. The preceeding hash character (#) marks them as comments, which means that bash will ignore them. SLURM will use them, however, to allocate resources. Syntax is very important: you cannot have any spaces until the end of the directive.
We have included the set -e command which causes the script to exit on error. This is helpful later. If our script crashes, it will raise an error code in SLURM. Without set -e, it will simply move to the next line and appear to complete successfully.
The remainder of our script calls several programs, including the echo, sleep, and date, commands, which print to the terminal, wait for a while, and report the date, respectively.

SLURM flags

Each of the slurm flags above has a very specific meaning. You can use the short or long version of each flag, for example -p express is equivalent to --partition=express.

The partition

The -p or --partition flag specifies which “queue” or “partition” services the job. With only minor exceptions, these partitions represent distinct physical components on the machine. We have selected the express partition, which is also the default if you omit this flag. This is a high-availability partition with low resource limits to prevent a single user from dominating the hardware. It should have very little wait time, and helps users to keep their minor calculations and post-processing work off of the shared login nodes.

The number of cores

We have selected a single core -c 1 for this job. Many codes can use multiple cores or nodes. In the text below we will explain the best way to select the number of cores. For now we only need one.

The time limit

We do not expect that this script will take very long, hence we have selected 0 hours, 5 minutes, and 0 seconds i.e. 0:5:0. We always request slightly more time than necessary, otherwise SLURM will prematurely terminate our job.

Submitting the job

To execute this script, use the following command.

$ sbatch script.sh

Check the status of the job with squeue -u $USER or more simply, sqme. First the job will report a state (ST) that is pending (PD) and after a moment it will be running (R). About 30 seconds later it will be absent from the list. If you ls this directory, you will see an output file.

job submission

If you cat the log file, you can see the results of our calculation:

$ cat slurm-35765896.out 
Hello, World!
Goodbye!
Sat Aug 10 20:32:58 EDT 2019

Note that the long number in the name of the log file, and in the squeue output above, is the SLURM job ID number, which uniquely identifies your job.

Blue Crab runs tens of thousands of jobs per day. Some are small, running on a single core, others may use up to 64 nodes at once.

In addition to submitting jobs with sbatch, you may also use scancel N to cancel job number N, sinfo to see the current status of all partitions, squeue -p <partition> to list all jobs for a particular partition, and sacct -u $USER to show completed jobs. Note that the variable $USER will be replaced with your username.

SLURM has many more features which are discussed at length on our website and in the documentation.

You may also add the following extra flags to your #SBATCH directives.

#SBATCH --mail-type=all and #SBATCH --mail-user=userid@jhu.edu sends an email at the beginning and end of each job (or specify either begin or end instead of all)
#SBATCH --job-name=My-Job_Name sets a job name
#SBATCH --account=[account] sets the account if you have multiple research groups associated with your username
#SBATCH --nodes=1 sets the number of nodes for a multi-node job, typically on the parallel partition
#SBATCH --ntasks-per-node=24 sets the number of MPI tasks for each node, only for use with MPI-enabled codes

Choosing resources

The example above was easy. Choosing the right resource for your calculations may be significantly more difficult. To select the right resource for your calculation, you should answer the following questions.

Is my calculation MPI-enabled?
Is my calculation threaded or OpenMP-enabled? Can I make use of multiple cores at once?
Does my calculation have any memory requirements?
How long will my calculation take to complete?
How well does my MPI or openMP code scale across resources?
How many SUs do I have?

The answers to these questions will help determine which partition you should use. The partitions are summarized in the following table.

Partition	Default/Max Time (hours)	Default/Max Cores per Node	Default/Max Memory per Node	Serial/Parallel	Backfilling	Limits
express (default)	1 hr / 12 hrs	1 / 6	3583 MB / ~86 GB	Serial (multi-thread)	Shared	1 node per job
shared	1 hr / 72 hrs	1 / 24	4916 MB / ~117GB	Serial (multi-thread)	Shared	1 node per job, 300 cores total
parallel	1 hr / 72 hrs	24 / 24	4196 MB / ~118 GB	Parallel	Exclusive	No limits
gpuk80	1 hr / 18 hrs	1 / 24	4916 MB / ~118 GB CPU 20 GB GPU	Serial, Parallel	Shared	4 GPUs / node
gpup100	1 hr / 12 hrs	1 / 24	4916 MB / ~118 GB CPU 24 GB GPU	Serial	Shared	2 GPUs (1 node) per user
lrgmem	1 hr / 72 hrs	1 / 48	120 GB / 1014 GB	Serial, Parallel	Shared	5 nodes per job
scavenger	6 hours	1 / 24-28	4916 MB / ~118 GB	Serial, Parallel	Shared	5 nodes per job
skylake	1 hr / 72 hrs	1 / 24	3583 MB / ~86 GB	Serial (multi-thread)	Shared	1 node per job
unlimited	1 hr / unlimited	1 / 24	4916 GB / ~118 GB	Serial, parallel	Shared	not guaranteed to complete

Compiling code

Do not use express or skylake to compile code. The newer architecture will add special features to your code which are not supported on the vast majority of the machine. Similarly, since our lrgmem partition has a slightly older architecture (Ivy Bridge), all code that must run on lrgmem should also be compiled there.

How to choose a partition

Here is a short guide to choosing a partition. Note that the nomenclature is somewhat overloaded: you can run a parallel or more accurately multi-threaded calculation on the shared partition. The shared partition can host exclusive jobs in which one user uses an entire node, but users may also request a portion of a node, thereby sharing it with other users.

The name of the parallel partition refers to the fact that most jobs use multiple nodes in parallel and therefore require MPI-enabled code and the use of the high-speed infiniband network. Note that a single-node, exclusive job is a rare point of intersection for both shared and parallel because this type of calculation is extremely common.

Most jobs can run on the shared, express, and skylake partitions.
Do not request more than one core if your code does not include multi-threading or parallelism via OpenMP or MPI or a manually-compiled parallel toolkit. If you use MPI, you should use mpiexec and if you use OpenMP i.e. threads, then you should set the BASH variable$OMP_NUM_THREADS to the right number of cores.
An exclusive, one-node, 24-core job can run on both shared and parallel.
If your job requires more than one node, and does not require excessive memory or a GPU, then you should use the parallel partition.
If your job requires more than one node, it must be an MPI-capable code. If you’re not sure if your code is MPI-capable, then it’s probably not. If you have an MPI-capable code, use --ntasks-per-node instead of --cpus-per-task to request more cores.
If you require smore than 118GB memory, you must use the large-memory lrgmem partitions. Otherwise, if your job requires up to 117992MB, then you should request a single node on shared or parallel. We will discuss memory usage in more detail below.
If your job requires only one GPU, then you should not request multiple GPUs. All requests for the GPU partitions should include the gres flag (see below). Since the V100 and P100 GPUs are scarce, almost all GPU requests should be sent to gpuk80.

Important: Our sytstem configuration is subject to change. Please consult our website for the latest information.

MPI jobs

The name for an individual process in a message-processing interface (MPI) code is a “task”. MPI is required for all communication between nodes on our cluster.

Requesting GPUs

The gpuk80, gpup100, and gpuv100 partitions require that you request a certain number of GPUs for each node using the following directive. In the following example we will request 4 GPUs.

#SBATCH --gres=gpu:4

Do not request multiple GPUs unless you are absolutely certain that your calculation can use more than one GPU! Many GPU codes also use MPI, in which case you should almost always set the number of tasks per node equal to the number of GPUs per node that you are using, and only use threading to consume the extra cores via --cpus-per-task. Some codes alternately use the maximum number of MPI tasks (i.e. 1 for each processor on the node), set --cpus-per-task=1 and rely on the GPU code to correctly interact with the GPUs. This may be specific to each application.

Memory

You may have noticed that the memory limits above are awkward. This is by design. Our compute nodes are disk-less, which means that the operating system resides in memory. We allocate 10GB for the operating system.

More importantly, users should almost never need to make a special memory request with SLURM’s --mem flag. We ensure that we always allocate the maximum available memory for each core that you request. SLURM will charge your account for the number of SUs equivalent to your memory request, even if you use only a single core. If you request --mem=8GB and a single core, you will be charged instead for 2 cores, because each core has typically ~5GB of memory associated with it.

Note: It is almost always best to estimate your memory requirements, and then request enough cores to provide this memory. The only time that you should request explicit memory is if you have a 1-core (i.e. single-threaded) application which requires more than the default memory per core. If you avoid requesting a specific amount of memory, we will allocate the maximum memory automatically.

If you have a single-threaded calculation, then you should request memory with e.g. #SBATCH --mem=50GB. This will reserve about half of one node.

If you have a multi-threaded application, it is best to request cores to match your memory requirement so that you can also make use of the extra cores. Here is an example. Imagine that you have a multi-threaded code that can use many cores. You know that it requires between 20-25GB and you wish to optimize the number of SUs you spend. In that case, you should request 5 cores on shared because 5x4916 = 24580MB < 25GB. Some simple arithmetic can help you optimize your core count and SU consumption.

Overriding a slurm script

You can use the #SBATCH directives directly in the sbatch command to override the directives in your script. In this example, we are overriding the time limit.

sbatch --time=1:0:0 script.sh

You can also omit #SBATCH directives entirely and submit standard bash scripts to SLURM as long as you include all of the required flags in your sbatch call.

Benchmarking

Since MARCC offers its users only a finite amount of hours (SUs), and many codes are not guaranteed to perform twice as fast with twice as many nodes or cores, we strongly recommend that you benchmark your calculations to find the most efficient set of resources. The more efficient your calculation, the more calculations you can do with the same allocation.

Important: Future allocation of MARCC resources may include a proposal and application process which requires benchmarks and corresponding justifications for the optimal use of our hardware. It is never too early to benchmark your code.

Next: interactive jobs