Scheduling jobs with SLURM

by Jesse Stroik — last modified Aug 07, 2014 10:29 AM

SLURM usage guide for scientists on systems at SSEC.

Contents

SLURM Description and Design

SLURM schedules jobs and manages nodes in similar manner compared with PBS, LSF and Gridengine. Whereas other schedulers submit all jobs with qsub or bsub, SLURM uses sbatch.
SLURM is designed to submit parallel (MPI / Hybrid) jobs from control jobs. A control job is just like a regular sbatch submission, but within the script it contains an srun line, similar to mpiexec or mpirun commands used in scripts for other schedulers.

PBS, LSF, Gridengine : qsub/bsub, mpiexec/mpirun
SLURM : sbatch, srun

Parallel Job Submission

Parallel job examples include WRF variants and various stages of the GFS which use MPI and/or OpenMP. We start by explaining the scheduler pre-processed flags.

Setting Job Flags in a Script

You can specify flags in your script with #SBATCH. For example, if you wanted to submit a job to the serial partition with the name YOUR_JOB_NAME and output in your home directory, you'd add this:

#!/bin/bash
#
#SBATCH --job-name=YOUR_JOB_NAME
#SBATCH --partition=serial
#SBATCH --output=${HOME}/output/YOUR_JOB_NAME-control.%j
#SBATCH --ntasks=1

Lines in your script that begin with:

#SBATCH

Are pre-processed by the scheduler. They do not affect other parts of your script but the scheduler takes them as modifiers to your job submission.

Example MPI Script

This is an sbatch script that you'd use to submit to the scheduler. Contained within is a srun command which runs a MPI parallel job:

#!/bin/bash
#SBATCH --job-name=YOUR_JOB_NAME
#SBATCH --partition=s4
#SBATCH --export=NONE
#SBATCH --ntasks=180
#SBATCH --mem-per-cpu=6000
#SBATCH --time=02:00:00
#SBATCH --output=/scratch/${USER}/output/YOUR_JOB_NAME-control.%j
source /etc/bashrc
module purge
module load licence_intel intel/14.0-2
module load impi
module load hdf hdf5
module load netcdf4/4.1.3

# Way1:within a script

# here you could call a script that creates your srun jobs and manages them
# or you could just run srun like this
srun --cpu_bind=core --distribution=block:block $HOME/path/to/mpi-executable

For those who have previously used mpiexec or mpirun, you will notice that srun is used similarly to those commands.

# Way2: from command line

srun --output=/scratch/${USER}/output/YOUR_JOB_NAME.%j --cpu_bind=core --distribution=block:block \
 --mem-per-cpu=6000 --time=2:00:00 --ntasks=120 $HOME/path/to/mpi-executable

'srun' is considered a job step within the SLURM job. The overall job needs to have sufficient resources allocated to it to execute the srun -- that is, if you want to run an MPI job with 200 CPU cores, you must specify that in your initial sbatch.
In this srun example, you see that it is issuing 120 MPI tasks. This will automatically go across multiple nodes. If you are submitting a job for which you anticipate using fewer than 10 CPU cores, please submit with --shared

--cpu_bind=core --distribution=block:block

This setting is ideal for MPI jobs and especially for Hybrid jobs

Job Limits and Resources Requested

It is important to understand how much memory your job will consume and how long it will take your job to run. Submitting jobs with accurate estimates for both will improve job scheduling for you.
Accurately specifying --time benefits you because jobs with a lower --time are likely to start sooner. The system will be configured with a low default.

--time: The lower your estimation, the quicker your job will be scheduled. Jobs must complete within the time allotted.

--mem-per-cpu: This is important to ensure jobs do not use more memory than is available on a node. SLURM will schedule accordingly if you need higher amounts of memory.

module commands are best begun fresh with a 'module purge' at the beginning of the job to ensure a consistent state. Then load the modules your job needs. This way, regardless of what happens with your environment, your job always gets the right modules. If you want very specific versions, be specific -- default for modules can change! In the example above, all module versions are specified explicitly.
If you do not yet understand module then please visit our documentation on module HERE before submitting jobs.

Submitting Hybrid jobs (dxu: hybrid is about OMP threads)

S4-Cardinal supports Hybrid jobs with MPI tasks running OpenMP threads. The scheduler needs to know how many threads you will issue per MPI task. MPI needs the same information with the environment variable OMP_NUM_THREADS.
dxu:

Hybrid jobs can drastically increase the amount of memory available to your tasks, but must be written with OpenMP loops.

The amount of CPUs you need is ntasks * threads. So if you have an MPI job that normally would run on 320 tasks, and you wanted to try 5 OpenMP threads (5 threads per MPI task) , you'd tell slurm --ntasks=64 ( number of MPI tasks). The following is a table to illustrate the relationship between tasks, threads and nodes.

dxu:
OpenMP : Open Multi Processing (OMP)
MPI : Message Passing Interface ( A standard API for message passing communication and process information lookup, registration, grouping and creation of new message data types.)

dxu:
--ntasks = 64 : number of MPI tasks
OMP_NUM_THREADS = 5 : 5 threads per MPI tasks

OMP_NUM_THREADS	ntasks	Nodes consumed
1 (pure MPI)	200	10 nodes
2	100	10 nodes
4	50	10 nodes
5	40	10 nodes
10	20	10 nodes

If you wished to run 5 OpenMP threads per MPI task, you would issue the following arguments to srun within your sbatch script:

export OMP_NUM_THREADS=5
srun --cpus-per-task=5 --ntasks=40 --distribution=block:block --cpu_bind=core --time=02:00:00 \
--mem-per-core=5500 <script>

The sockets in S4-Cardinal are 10-core, so we recommend issuing job that use 2, 5 or 10 threads. In our experience, 5 is the fastest GFS fcst, but it may vary for you.
IMPORTANT: --mem-per-core has a multiplicative effect on your memory available when running OpenMP. If you run 5 threads, for example, that MPI task has approximately 5x the memory requested per core at its disposal.

Serial Job Submission

An example serial job might be a job that just manipulates I/O, such as combining files, or a job-step that is not written to take advantage of MPI or OpenMP.

#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --partition=serial
#SBATCH --share
#SBATCH --time=1:20:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4500
#SBATCH --output=$HOME/job-output/my-serial-output.txt
module purge
module load license_intel
module load impi/4.1.3.049
module load intel/14.0-2
module load hdf/4.2.9
module load hdf5/1.8.12
module load netcdf4/4.1.3
export I_MPI_JOB_STARTUP_TIMEOUT=10000
$HOME/myscript.scr  # job runs here

Note the use of --share. This ensures that you do not run exclusively on a node with 20 CPU cores available and use only one.(dxu: not to waste resources !!! ) If you have a multhreaded program that requires multiple CPUs, you could use this same script but change --cpus-per-task=X where X is the number of processors you can use.

Overview of SLURM commands and arguments

Primary user-facing commands

Job submission is done via two commands: sbatch and srun.

sbatch: Handles serial jobs. Also used as a first step for MPI jobs.
srun: MPI-capable job submission, usually run from within a script run via sbatch.

Additionally, for examining the queue you can run sinfo and squeue.

sinfo: Displays nodes/and partitions.
squeue: Shows you the list of jobs running and yet to be run. It also will have state, including R(unning) and PD(pending)

To manage your jobs, you will need scancel and sstat:

scancel: delete a job
sstat: get the status of a job
scontrol: manipulate or view details on a job. For example 'scontrol -dd show job <jobid>'

Nomenclature

SLURM uses new nomenclature to help make important distinctions. Taking time to understand the nomenclature is fundamental to SLURM fluency.

Queue: the list of slurm jobs being executed and yet to be executed

Task: a job or a job-step (sub-job). This distinction is helpful because 'job' in SLURM now means an overall job, whereas a task can be a job or a job-step.

Node: a computer typically with multiple CPUs. Many computers make up the cluster and each have limited CPU cores and Memory.

JobID: the numeric identifier that can be used to specify a job (as shown in squeue)

Partition: a set of hardware with similar rules and priorities -- referred to as a queue by other schedulers

NOTE: CPUs/Cores are sometimes used interchangeably. If the system means an actual CPU, it will usually be denoted as socket.

Common Arguments for sbatch and srun

--ntasks=<NUMBER> the number of tasks your job needs. Often, you will run 1 CPU core per task
--mem=<NUMBER> the amount of memory your job will use per node. Usually, you are better of requesting --mem-per-cpu
--mem-per-cpu=<NUMBER> The memory (in MB) you job needs per CPU Core. This defaults to 6GB
--cpus-per-task=<NUMBER> The number of CPUs each of your processes might use. An example is if you were using MPI + OpenMP hybrid models
--job-name=<STRING> please describe your job with appropriate acronyms, such as FSCT, GSI, WRF, etc
--shared if you plan to issue jobs that will use fewer than 10 CPU cores, we recommend adding --shared

Job Dependencies

For complex multi-step sbatch jobs, job depenencies are very useful. The basic syntax is simply

sbatch --dependency=afterok:<JOBID> myjob

This says run 'myjob' after job with id <JOBID> completes successfully. If submitting multiple jobs by hand you simply look at the job id and enter it manually. However, for multi-step jobs it is likely preferable to submit them automatically.

To do this, one technique is to submit your dependent job *inside* the first job. This works well because $SLURM_JOB_ID is available.

Example job dependency script

In this example we run the MPI job in the earlier example, but then have an "file copy" job that copies data after the job has completed.

#!/bin/bash
#SBATCH --job-name=YOUR_JOB_NAME
#SBATCH --partition=s4
#SBATCH --export=NONE
#SBATCH --ntasks=180
#SBATCH --mem-per-cpu=6000
#SBATCH --time=02:00:00
#SBATCH --output=/scratch/${USER}/output/YOUR_JOB_NAME-control.%j
source /etc/bashrc
module purge
module load licence_intel intel/14.0-2
module load impi
module load hdf hdf5
module load netcdf4/4.1.3

#Submit the "file_copy" dependent job
#  Note you could also pass $SLURM_JOB_ID as a parameter to file_copy

sbatch --dependency=afterok:$SLURM_JOB_ID file_copy

# here you run your job as before, this is srun, but it could be anything.
srun --cpu_bind=core --distribution=block:block $HOME/path/to/mpi-executable

file_copy - would be it's own sbatch script. If desired, that script could issue another job dependency, and so on.

Additional Resources

The SLURM site at schedmd.com has user documentation, as well as a translation article they refer to the rosetta stone of schedulers which can sometimes help users who came from different schedulers.

System and Scheduler Related Assistance

Please contact the S4 support team at SSEC using the following email address:

s4.admin@ssec.wisc.edu

Deyong Xu's blogs

Pages

Tuesday, August 12, 2014

SLURM