Slurm job scheduler

At Abacus we use Slurm for scheduling jobs.

In general, Abacus is intended to be used for batch jobs, i.e., jobs which are run without user intervention. It is also possible to run interactive jobs as described later.

A typical user usage scenario is the following

  1. The user logs in to fe.deic.sdu.dk.
  2. A previous job script is edited with new parameters.
  3. The job script is submitted to the job queue.
  4. The user logs out
  5. Later, after the job has completed, the user logs in again to retrieve the result.

Job scripts as mentioned in Steps 2 and 3 contains both details of which computer resources are needed (number and types of nodes, etc.) and details on which application should be run and how (name and version of the application, input and output, etc.).

General commands

You can use man to get further documentation one the commands mentioned later:

testuser@fe1:~$ man COMMAND

Try the following commands

testuser@fe1:~$ man sbatch
testuser@fe1:~$ man squeue
testuser@fe1:~$ man scancel

Accounts

To see which accounts are available to you, including how many node hours are available, use the command abc-quota:

testuser@fe1:~$ abc-quota

Available node hours per account/user
=====================================

Account/user |   Quota   Avail | UsedPeriod   % of Qt | UsedMonth
------------ + ------- ------- + ---------- --------- + ---------

test00_gpu   |   2,000   1,220 |        780    39.4 % |       650
 otheruser   |                 |         80     4.4 % |        50
 testuser *  |                 |        700    35.0 % |       600

...

In this case, testuser can use the account test00_gpu. Within this accounting period, the user testuser has used 700 node hours, and the test00_gpu account has used in total used 780 hours. 1,220 node hours are still available. As shown in the column UsedMonth most node hours have been used during this month.

Submitting jobs

The following is a minimal job script - it generates a lot of random numbers and then sorts them. See later for a more realistic job script. For any job script, you should specify the account to use, the number of nodes you want (default 1), and the maximum wall time (at most 24 hours):

#! /bin/bash
#
#SBATCH --account test00_gpu      # account
#SBATCH --nodes 1                 # number of nodes
#SBATCH --time 2:00:00            # max time (HH:MM:SS)

for i in {1..10000}; do
  echo $RANDOM >> random.txt
done

sort random.txt

Note that Slurm parameters must be specified at the top of the file before any real commands. Further, #SBATCH must appear at the start of the line written exactly as #SBATCH.

To submit the job, write the above contents to a file, e.g., myscript.sh, and run the command:

testuser@fe1:~$ sbatch myscript.sh

You can also add extra options for sbatch overriding the values in the script itself, e.g.,

testuser@fe1:~$ sbatch --time 4:00:00 myscript.sh

Information on jobs

List all current, running or pending jobs for the user testuser:

testuser@fe1:~$ squeue -u testuser
testuser@fe1:~$ squeue -u testuser -t RUNNING
testuser@fe1:~$ squeue -u testuser -t PENDING

List detailed information for a job (sometimes useful for troubleshooting):

testuser@fe1:~$ scontrol show jobid -dd <jobid>

To cancel a single job, all jobs or all pending jobs for the user testuser:

testuser@fe1:~$ scancel <jobid>
testuser@fe1:~$ scancel -u testuser
testuser@fe1:~$ scancel -u testuser -t PENDING

Interactive jobs

To run an interactive job you can either use the srun command with a few options. This preserves your current environment including e.g. loaded modules, but does not offer X11 forwarding:

testuser@fe1:~$ srun -A test00_gpu --time 1:00:00 --pty bash -i
testuser@s32p19:~$

The second variant does offer X11 forwarding, but clears the LD_LIBRARY_PATH variable, i.e., to make everything work, you have to reload software modules, etc.

testuser@fe1:~$ sinteractive -A test00_gpu --time 1:00:00
Waiting for JOBID 8476 to start
testuser@s32p19:~$ #

In both cases you by default get 1 node. You can add extra options, if you need multiple nodes, etc:

testuser@fe1:~$ sinteractive -A test00_gpu --time 1:00:00 --nodes 32
Waiting for JOBID 8476 to start
testuser@s32p19:~$ #

Jobscript tips

  • Walltime --time: Set the maximum wall time as low as possible enables Slurm to possibly pack your job on idle nodes currently waiting for a large job to start.
  • Nodes --nodes: If your job can be flexible, use a range of the number of nodes needed to run the job, e.g., --nodes=4-6. In this case your job starts running when at least 4 nodes are available. If at that time 5 or 6 nodes are available, your job gets all of them.
  • Tasks per node, --ntasks-per-node: Use this to select how many MPI ranks you want per node, e.g., 24 if you want one rank per cpu core or 2 if you want one mpi rank per gpu card.

Note that you do not need to specify the following in your job scripts

  • Partition: The partition is automatically derived from the account you use, e.g., test00_gpu implies the partition gpu.
  • Memory use, e.g., --mem or --mem-per-spu: By default you get all the RAM on the nodes you are running.
  • GPU cards, i.e., --gres=gpu:2: If you are running on a gpu node, you automatically get access to both gpu cards

MPI jobs

For MPI jobs, you should use a combination of --nodes and --ntasks-per-node to get the number of nodes and MPI ranks per node you want. Both have a default value of 1.

For all MPI implementations available as as module at Abacus, the recommended way to start MPI applications is using srun, i.e., not mpirun or similar.

#! /bin/bash
#
#SBATCH --account test00_gpu      # account
#SBATCH --nodes 4                 # number of nodes
#SBATCH --ntasks-per-node 24      # number of MPI tasks per node
#SBATCH --time 2:00:00            # max time (HH:MM:SS)

echo Running on "$(hostname)"
echo Available nodes: "$SLURM_NODELIST"
echo Slurm_submit_dir: "$SLURM_SUBMIT_DIR"
echo Start time: "$(date)"

# Load the modules previously used when compiling the application
module purge
module add gcc/4.8-c7 openmpi/1.8.4

# Start in total 4*24 MPI ranks on all available CPU cores
srun my-mpi-application -i input.txt -o output.txt

echo Done.

Further jobscript examples

Purely sequential job

#!/bin/bash

#SBATCH --account test00_gpu      # account
#SBATCH --nodes 1                 # number of nodes
#SBATCH --time 2:00:00            # max time (HH:MM:SS)

./serial.exe

Amber, Gaussian, Gromacs, Namd, etc

You can find sample sbatch job scripts in the folder /opt/sys/documentation/sbatch-scripts/ on the Abacus frontend nodes. For the software packages installed on Abacus, you can also look at our software page for further information.

Using as few switches as possible

The InfiniBand switches in Abacus are connected using a 3D torus. By default, Slurm always starts your job as soon as possible. When enough nodes are available for the job, i.e., the job is ready to start, Slurm packs the job on the available nodes as good as possible.

If you have a very network intensive job, you may want to ensure that your job is packed as good as possible, even at the cost of the job maybe starting later than would otherwise be possible.

For all the the possible sbatch --switches options below, there is a time limit of one hour, i.e., after one hour, the --switches option is ignored.

sbatch --switches 1

Run everything using nodes from one switch (at most 16 slim/fat nodes or 18 gpu nodes)

sbatch --switches 2

Run everything using nodes from at most two neighbour switches (at most 32 slim/fat nodes or 34 gpu nodes).

sbatch --switches 3

Run everything using nodes from at most 2x2 neighbour switches (at most 64/72 nodes). For both fat and gpu nodes, there is no need to specify this as there is only 64 respectively 72 nodes available.