Scheduling jobs with SLURM
          by
          Jesse Stroik
        
      
      
        —
        
          last modified
        
        Aug 07, 2014 10:29 AM
      
  
  
  
  
            SLURM usage guide for scientists on systems at SSEC.
        
    
        
- Contents
SLURM Description and Design
SLURM is designed to submit parallel (MPI / Hybrid) jobs from control jobs. A control job is just like a regular sbatch submission, but within the script it contains an srun line, similar to mpiexec or mpirun commands used in scripts for other schedulers.
PBS, LSF, Gridengine : qsub/bsub, mpiexec/mpirun
SLURM : sbatch, srun
Parallel Job Submission
Parallel job examples include WRF variants and various stages of the GFS which use MPI and/or OpenMP. We start by explaining the scheduler pre-processed flags.
Setting Job Flags in a Script
You can specify flags in your script with #SBATCH. For example, if you wanted to submit a job to the serial partition with the name YOUR_JOB_NAME and output in your home directory, you'd add this:
#!/bin/bash # #SBATCH --job-name=YOUR_JOB_NAME #SBATCH --partition=serial #SBATCH --output=${HOME}/output/YOUR_JOB_NAME-control.%j #SBATCH --ntasks=1
Lines in your script that begin with:
#SBATCH
Are pre-processed by the scheduler. They do not affect other parts 
of your script but the scheduler takes them as modifiers to your job submission. 
Example MPI Script
This is an sbatch script that you'd use to submit to the scheduler. Contained within is a srun command which runs a MPI parallel job:
#!/bin/bash #SBATCH --job-name=YOUR_JOB_NAME #SBATCH --partition=s4 #SBATCH --export=NONE #SBATCH --ntasks=180 #SBATCH --mem-per-cpu=6000 #SBATCH --time=02:00:00 #SBATCH --output=/scratch/${USER}/output/YOUR_JOB_NAME-control.%j source /etc/bashrc module purge module load licence_intel intel/14.0-2 module load impi module load hdf hdf5 module load netcdf4/4.1.3
# Way1:within a script  
# here you could call a script that creates your srun jobs and manages them
# or you could just run srun like this
srun --cpu_bind=core --distribution=block:block $HOME/path/to/mpi-executable
For those who have previously used mpiexec or mpirun, you will notice that srun is used similarly to those commands.# Way2: from command line
srun --output=/scratch/${USER}/output/YOUR_JOB_NAME.%j --cpu_bind=core --distribution=block:block \
 --mem-per-cpu=6000 --time=2:00:00 --ntasks=120 $HOME/path/to/mpi-executable
'srun' is considered a job step within the SLURM job. The overall job needs to have sufficient resources allocated to it to execute the srun -- that is, if you want to run an MPI job with 200 CPU cores, you must specify that in your initial sbatch.In this srun example, you see that it is issuing 120 MPI tasks. This will automatically go across multiple nodes. If you are submitting a job for which you anticipate using fewer than 10 CPU cores, please submit with --shared
--cpu_bind=core --distribution=block:block
This setting is ideal for MPI jobs and especially for Hybrid jobs
Job Limits and Resources Requested
Accurately specifying --time benefits you because jobs with a lower --time are likely to start sooner. The system will be configured with a low default.
--mem-per-cpu: This is important to ensure jobs do not use more memory than is available on a node. SLURM will schedule accordingly if you need higher amounts of memory.
module commands are best begun fresh with a 'module purge' at the beginning of the job to ensure a consistent state. Then load the modules your job needs. This way, regardless of what happens with your environment, your job always gets the right modules. If you want very specific versions, be specific -- default for modules can change! In the example above, all module versions are specified explicitly.
If you do not yet understand module then please visit our documentation on module HERE before submitting jobs.
module commands are best begun fresh with a 'module purge' at the beginning of the job to ensure a consistent state. Then load the modules your job needs. This way, regardless of what happens with your environment, your job always gets the right modules. If you want very specific versions, be specific -- default for modules can change! In the example above, all module versions are specified explicitly.
If you do not yet understand module then please visit our documentation on module HERE before submitting jobs.
Submitting Hybrid jobs (dxu: hybrid is about OMP threads)
S4-Cardinal supports Hybrid jobs with MPI tasks running OpenMP threads.
 The scheduler needs to know how many threads you will issue per MPI 
task. MPI needs the same information with the environment variable OMP_NUM_THREADS.
dxu:
dxu:
Hybrid jobs can drastically increase the amount of memory available to your tasks, but must be written with OpenMP loops.
The amount of CPUs you need is ntasks * threads. So if you have an MPI job
 that normally would run on 320 tasks, and you wanted to try 5 OpenMP 
threads (5 threads per MPI task) , you'd tell slurm --ntasks=64 ( number of MPI tasks). The following is a table to 
illustrate the relationship between tasks, threads and nodes.
 dxu: 
OpenMP : Open Multi Processing (OMP)
MPI : Message Passing Interface ( A standard API for message passing communication and process information lookup, registration, grouping and creation of new message data types.)
dxu:
--ntasks = 64 : number of MPI tasks
OMP_NUM_THREADS = 5 : 5 threads per MPI tasks
OpenMP : Open Multi Processing (OMP)
MPI : Message Passing Interface ( A standard API for message passing communication and process information lookup, registration, grouping and creation of new message data types.)
dxu:
--ntasks = 64 : number of MPI tasks
OMP_NUM_THREADS = 5 : 5 threads per MPI tasks
| OMP_NUM_THREADS | ntasks | Nodes consumed | 
| 1 (pure MPI) | 200 | 10 nodes | 
| 2 | 100 | 10 nodes | 
| 4 | 50 | 10 nodes | 
| 5 | 40 | 10 nodes | 
| 10 | 20 | 10 nodes | 
If you wished to run 5 OpenMP threads per MPI task, you would issue the following arguments to srun within your sbatch script:
export OMP_NUM_THREADS=5 srun --cpus-per-task=5 --ntasks=40 --distribution=block:block --cpu_bind=core --time=02:00:00 \ --mem-per-core=5500 <script>
IMPORTANT: --mem-per-core has a multiplicative effect on your memory available when running OpenMP. If you run 5 threads, for example, that MPI task has approximately 5x the memory requested per core at its disposal.
Serial Job Submission
An example serial job might be a job that just manipulates I/O, such as combining files, or a job-step that is not written to take advantage of MPI or OpenMP.
#!/bin/bash #SBATCH --job-name=my-job #SBATCH --partition=serial #SBATCH --share #SBATCH --time=1:20:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=4500 #SBATCH --output=$HOME/job-output/my-serial-output.txt module purge module load license_intel module load impi/4.1.3.049 module load intel/14.0-2 module load hdf/4.2.9 module load hdf5/1.8.12 module load netcdf4/4.1.3 export I_MPI_JOB_STARTUP_TIMEOUT=10000 $HOME/myscript.scr # job runs here
Note the use of --share. This ensures that you do not run 
exclusively on a node with 20 CPU cores available and use only one.(dxu: not to waste resources !!! ) If 
you have a multhreaded program that requires multiple CPUs, you could 
use this same script but change --cpus-per-task=X where X is the number 
of processors you can use.
Overview of SLURM commands and arguments
Primary user-facing commands
sbatch: Handles serial jobs. Also used as a first step for MPI jobs.
srun: MPI-capable job submission, usually run from within a script run via sbatch.
Additionally, for examining the queue you can run sinfo and squeue. 
sinfo: Displays nodes/and partitions.
squeue: Shows you the list of jobs running and yet to be run. It also will have state, including R(unning) and PD(pending)
To manage your jobs, you will need scancel and sstat:
sstat: get the status of a job
scontrol: manipulate or view details on a job. For example 'scontrol -dd show job <jobid>'
Nomenclature
SLURM uses new nomenclature to help make important distinctions. 
Taking time to understand the nomenclature is fundamental to SLURM 
fluency.
Queue: the list of slurm jobs being executed and yet to be executed
Task: a job or a job-step (sub-job). This distinction is helpful because 'job' in SLURM now means an overall job, whereas a task can be a job or a job-step.
Node: a computer typically with multiple CPUs. Many computers make up the cluster and each have limited CPU cores and Memory.
JobID: the numeric identifier that can be used to specify a job (as shown in squeue)
Partition: a set of hardware with similar rules and priorities -- referred to as a queue by other schedulers
NOTE: CPUs/Cores are sometimes used interchangeably. If the system means an actual CPU, it will usually be denoted as socket.
Common Arguments for sbatch and srun
--ntasks=<NUMBER> the number of tasks your job needs. Often, you will run 1 CPU core per task
--mem=<NUMBER> the amount of memory your job will use per node. Usually, you are better of requesting --mem-per-cpu
--mem-per-cpu=<NUMBER> The memory (in MB) you job needs per CPU Core. This defaults to 6GB
--cpus-per-task=<NUMBER> The number of CPUs each of your processes might use. An example is if you were using MPI + OpenMP hybrid models
--job-name=<STRING> please describe your job with appropriate acronyms, such as FSCT, GSI, WRF, etc
--shared if you plan to issue jobs that will use fewer than 10 CPU cores, we recommend adding --shared
Job Dependencies
For complex multi-step sbatch jobs, job depenencies are very useful.  The basic syntax is simply
sbatch --dependency=afterok:<JOBID> myjob
This says run 'myjob' after job with id <JOBID> completes successfully. If submitting multiple jobs by hand you simply look at the job id and enter it manually. However, for multi-step jobs it is likely preferable to submit them automatically.
To do this, one technique is to submit your dependent job *inside* the first job. This works well because $SLURM_JOB_ID is available.
Example job dependency script
In this example we run the MPI job in the earlier example, but then have an "file copy" job that copies data after the job has completed.
#!/bin/bash #SBATCH --job-name=YOUR_JOB_NAME #SBATCH --partition=s4 #SBATCH --export=NONE #SBATCH --ntasks=180 #SBATCH --mem-per-cpu=6000 #SBATCH --time=02:00:00 #SBATCH --output=/scratch/${USER}/output/YOUR_JOB_NAME-control.%j source /etc/bashrc module purge module load licence_intel intel/14.0-2 module load impi module load hdf hdf5 module load netcdf4/4.1.3 #Submit the "file_copy" dependent job # Note you could also pass $SLURM_JOB_ID as a parameter to file_copy sbatch --dependency=afterok:$SLURM_JOB_ID file_copy # here you run your job as before, this is srun, but it could be anything. srun --cpu_bind=core --distribution=block:block $HOME/path/to/mpi-executable
file_copy - would be it's own sbatch script. If desired, that script could issue another job dependency, and so on.
Additional Resources
The SLURM site at schedmd.com
 has user documentation, as well as a translation article they refer to 
the rosetta stone of schedulers which can sometimes help users who came 
from different schedulers. 
System and Scheduler Related Assistance
Please contact the S4 support team at SSEC using the following email address:
s4.admin@ssec.wisc.edu
 
No comments:
Post a Comment