The job scheduler

Last updated on 2025-12-15 | Edit this page

Overview

Questions

  • What is a job scheduler?
  • How do I run a program on HPC?
  • How do I check on a submitted job or cancel it?

Objectives

  • Be able to navigate the filesystem
  • Submit a serial job to the scheduler
  • Monitor the execution status of jobs (running/waiting/finished/failed)
  • Inspect the output and error files from a job
  • Find out how long your job took to run and what resources it used

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

We will be using the SLURM scheduler, which is probably the most widely used scheduler for HPC, but many of the concepts we’ll encounter are transferable to other scheduler software you might encounter when using HPC, such as SGE.

Users interact with the scheduler using a jobscript, which is plain text file containing commands to be run, and requests for computing resources e.g. how much time, memory, and CPUs the job can use.

Jobscripts are usually submitted from the scratch filesystem, because it has more space and reads/writes files more quickly. Our training cluster doesn’t have a scratch filesystem, so for this course we’ll work in our home directories instead.

Submitting a jobscript


The first step we’re going to take towards writing a jobscript is to write a small shell script – essentially a text file containing a list of UNIX commands to be executed in a sequential manner.

We’ll use a command-line text editor called nano to create this file. This is a fairly intuitive, light weight text editor that is widely available.

BASH

yourUsername@login:~$ nano hostname.sh

This starts a CLI text editor that looks like this:

screenshot of nano text editor
Screenshot of the nano text editor

Our shell script will have three parts:

  1. On the very first line, add #!/bin/bash. The #! (pronounced “hash-bang” or “shebang”) tells the computer what program is meant to process the contents of this file. In this case, we are telling it that the commands that follow are written for the Bash shell. All characters after a # symbol in a shell script are treated as comments (i.e. for information only) and are not executed as commands. Most commonly, comments are written for the benefit of someone reading the script, but the shebang can be thought of as a special type of comment for the computer to read.
  2. Anywhere below the first line, we’ll add an echo command with a friendly greeting. When run, the shell script will print whatever comes after echo in the terminal. echo -n will print everything that follows, without ending the line by printing the new-line character.
  3. On the last line, we’ll invoke the hostname command, which will print the name of the machine the script is run on.

When finished the file should look like this:

#!/bin/bash

echo -n "This script is running on "
hostname

To close the editor when finished, we use the keyboard shortcuts shown at the bottom of the screen: Ctrl+O to save (Write Out), followed by Return to confirm the file name, and then Ctrl+X to exit.

Callout

Mac users also need to use Ctrl rather than CMD.

Ok, so we’ve written a shell script—how do we run it? You might be familiar with other scripting languages such as python, and be familiar with running a python script using something like :

BASH

$ python my_script.py

We can do similar with our shell script, and run it using

BASH

yourUsername@login:~$ bash hostname.sh
Challenge

Challenge

If you haven’t already, run the shell script hostname.sh. Does it execute on the cluster or just our login node?

BASH

yourUsername@login:~$ bash hostname.sh

OUTPUT

This script is running on login1

The script is running on the login node. We’ll find out next how to submit a job to the scheduler in order to run on a compute node.

This script ran on the login node, but we want to take advantage of the compute nodes: we need the scheduler to queue up hostname.sh to run on a compute node.

To submit a jobscript to the scheduler, we use the SLURM sbatch command. This creates a job which will run the script when dispatched to a compute node which the queuing system has identified as being available to perform the work.

A jobscript should normally contain some SLURM directives to tell the scheduler something about the resources we want available for the job e.g. how many cores to allocate, a time limit, and what type of job this is (e.g. serial, parallel, high memory etc)

Our shell script doesn’t yet contain any of this information for the job scheduler, so we wouldn’t expect it to work.

We can try submitting it anyway—let’s see what happens.

The command we’re going to need is sbatch hostname.sh, which we could just type out, but there is a shortcut. We’ll start by typing sbatch h, then press the Tab key. This will show us a list of possible matches starting with the letter h, or complete the file name for us if there is only one match.

This is called tab-completion, and can be used to complete the names of files and directories, in addition to commands. Not only does this make your typing quicker, it also reduces typos because it can only complete file and command names that exist.

BASH

yourUsername@login:~$ sbatch hostname.sh

OUTPUT

Submitted batch job 340

The training cluster we are using has some default settings for SLURM, so our job submitted without errors, but in general, for your job to run correctly, you will need to tell the scheduler which partition to use, how many cores, and specify a time limit for your job. Failure to provide enough information results in an error like the one below:

OUTPUT

sbatch: error: Batch job submission failed: Invalid partition name specified

The SLURM partitions are equivalent to job queues for different types of jobs e.g. serial, parallel, GPU, high memory etc that we encountered in the introduction section).

We’ll supply the missing information by editing the jobscript. We could use tab completion again to speed up typing nano hostname.sh by using nano h + Tab, or another shortcut is to use the up arrow key to cycle through previous commands until we get the one we want.

BASH

#!/bin/bash
#SBATCH -p compute     # The name of the available partitions varies between clusters
#SBATCH -t 2           # Set a time limit of 2 minutes


echo -n "This script is running on "
hostname

The shebang line must be the first line of the jobscript, so we add our SLURM directives underneath the initial shebang line, but above the commands we want to run as our job. The directives start with #SBATCH and are essentially a special type of comment which is interpreted by the SLURM software. The first directive is #SBATCH -p to specify which partition we want to use. Partitions can be listed using the SLURM command sinfo, but the cluster documentation will normally explain when to use each partition. The second directive #SBATCH -t indicates a time limit for our job.

Now that we have a jobscript with SLURM directives, let’s check that we’ve not made any errors by submitting it to the scheduler again:

BASH

yourUsername@login:~$ sbatch hostname.sh

We should see something like this that indicates the job has been submitted to the scheduler.

OUTPUT

Submitted batch job 2461232

We can check on the status of a running job using the squeue command.

BASH

yourUsername@login:~$ squeue

OUTPUT

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

There are no jobs listed (only the column headings), so we can logically assume that our jobs have either completed successfully, or failed.

Challenge

Where’s the output from the test job?

When we ran our shell script on the login node, we saw output in the terminal telling us which node the job was running on.

You’ll notice that you don’t see any output printed to the terminal as your jobscript runs. Where does the output go?

Cluster job output is typically redirected to a file in the directory you submitted it from. Use ls to find and nano to view the file.

Your output should be in a file called slurm-[JOB_ID].out. One way to view the contents is with nano slurm-[JOB-ID].out, or to print the contents in your terminal, use cat slurm-[JOB-ID].out.

Checking on a job


Challenge

Check on the status of a submitted job

Edit the hostname.sh script to add a sleep time of 1 minute. This will give you enough time to check on the job before it finishes running. Submit the jobscript again, then run squeue to view the status of the submitted job.

Your test jobscript should look like this:

BASH

#!/bin/bash
#SBATCH -p compute
#SBATCH -t 2

sleep 60
echo -n "This script is running on "
hostname

The output from squeue will look a bit like this:

OUTPUT

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  173   compute hostname username  R       0:02      1 compute01

We’ve already encountered some of the column headings such as the job ID, and SLURM partition, and we can reasonably expect NAME to refer to the name of the job (defaults to the jobscript file name if you’ve not explicitly given a name).

ST means status and displays the job state code. Common codes include R for running, CA for cancelled, CD for completed, PD for pending (waiting for resources), and F for failed.

TIME refers to how long the job has been running for, NODES shows how many nodes the job is running on, and NODELIST lists the nodes the job is running on. If it hasn’t started yet, a reason is given in brackets such as (Priority) meaning that other users have higher priority, or (Resources) which means that resources are not currently available for your job.

Challenge

Cancel a running job

Sometimes you’ll realise there is a problem with your job soon after submitting it.

In this scenario it is usually preferable to cancel the job rather than let it complete with errors, or produce output that is incorrect. It is also less wasteful of resources.

Resubmit your hostname.sh jobscript. View the status (and job ids) of your running jobs using squeue (JOBID is the first column).

Make a note of the job id of your hostname.sh then cancel the job using

BASH

scancel [job-id]

before it finishes.

Verify that the job was cancelled by looking at the job’s log file, slurm-[job-id].out.

Your output should look similar to this:

yourUsername@login:~$ sbatch hostname.sh
Submitted batch job 332
yourUsername@login:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               332   compute hostname username  R       0:02      1 compute01
yourUsername@login:~$ scancel 332
yourUsername@login:~$ cat slurm-332.out
This script is running on compute01
slurmstepd-compute01: error: *** JOB 332 ON compute01 CANCELLED AT 2025-10-13T12:53:19 ***
Challenge

Check the status of a completed job

squeue shows jobs that are running or waiting to run, but to view all jobs including those that have finished or failed, we need to use sacct.

Run sacct now to view all the jobs you have submitted so far.

Running sacct shows the job ID and some statistics about your jobs.

For now we are most interested in the State column—the examples so far in the course should result in at least one job with each of the states “COMPLETED”, “CANCELLED”, and “TIMEOUT”.

Other statuses you might see include “RUNNING”, “PENDING”, or “OUT_OF_MEMORY”.

BASH

yourUsername@login:~$ sacct

OUTPUT

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
330          hostname.+    compute     (null)          0  COMPLETED      0:0
331          hostname.+    compute     (null)          0    TIMEOUT      0:0
332          hostname.+    compute     (null)          0  CANCELLED      0:0

Resource requests


Resource requests (e.g. time limit, number of CPUs, memory) are typically binding. If you exceed them, your job will be killed i.e. automatically cancelled by the scheduler. Let’s use wall time as an example. We will request 1 minute of wall time, and attempt to run a job for two minutes.

Let’s edit the third and 5th line of the jobscript to make these changes:

BASH

#!/bin/bash
#SBATCH -p compute     # The name of the available partitions varies between clusters
#SBATCH -t 1           # Set a time limit of 1 minute

sleep 120              # time in seconds
echo -n "This script is running on "
hostname

and resubmit the jobscript and wait for it to finish:

BASH

yourUsername@login:~$ sbatch hostname.sh

We can then view the status of the job using sacct — the most recent job should have the status of “TIMEOUT”. Something like this:

BASH

yourUsername@login:~$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
102          hostname.+    compute     (null)          0    TIMEOUT      0:0 

We can also check the log file by printing it using cat:

BASH

yourUsername@login:~$ cat slurm-[JobID].out
This script is running on compute01
slurmstepd-compute01: error: *** JOB 178 ON compute01 CANCELLED AT 2025-10-09T10:14:17 DUE TO TIME LIMIT ***

Our job was killed for exceeding the amount of resources it requested. Although this appears harsh, this is actually a good feature. Strict adherence to resource requests allows the scheduler to find the best possible place for your jobs. Even more importantly, it ensures that another user cannot use more resources than they’ve been given. If another user messes up and accidentally attempts to use all of the cores or memory on a node, Slurm will either restrain their job to the requested resources or kill the job outright. Other jobs on the node will be unaffected. This means that one user cannot mess up the experience of others – the only jobs affected by a mistake in scheduling will be their own.

It is a good idea to request slightly more resources than you expect to need. If you request a lot more (e.g. time) than your job needs it will likely sit in the scheduler queue for longer. You can get some idea of the likely resources required from previous, similar jobs you have submitted—sacct usually outputs more details than on our training cluster.

Callout

Time formats

We have previously used the -t 2 option to request 2 minutes. There are multiple acceptable formats possible to specify the time limit: “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.

Interactive jobs


Most of the work done on HPC tends to be batch jobs, where you submit your jobscript to the scheduler and you have no further interaction with it while it runs.

Sometimes you need to run a task interactively e.g. using a GUI, or perhaps you’re debugging a failed job, or testing a small example before submitting a batch job.

These are typically not suitable tasks to run on the login node as they’re too resource-intensive, so usually there is an interactive partition which will enable you to use a compute node directly for a limited time period.

The details vary between HPC clusters, but will use some variation on the srun command. Check the documentation for your HPC cluster to find out how to do this.

Key Points
  • A jobscript is a shell script containing SLURM commands in addition to the commands you want to run
  • Submit your jobscript using sbatch
  • Run your jobs from the “scratch” filesystem
  • Request slightly more resources than you will need