Parallel jobs
Last updated on 2025-10-20 | Edit this page
Overview
Questions
- How do I use multiple cores for my HPC job?
- How much faster is it to use more cores?
Objectives
- Be able to run a parallel job using SLURM
- Appreciate that parallel speed up is rarely linear
- Understand that there are diminishing returns when running a job using multiple cores
- Understand these are due to the serial/parallel split, communication between cores/nodes, frequency of data writing
We are going to run a parallel job using the Amdahl software we previously copied onto the cluster. It is a “fake work” program that simulates a task with both serial and parallel components.
We will use it to investigate Amdahl’s Law which predicts the theoretical speedup of a task when multiple processors are used for parallel processing. It highlights that the speedup is limited by the portion of the task that cannot be parallelized.
Install the Amdahl Program
With the Amdahl source code on the cluster, we can install it, which
will provide access to the amdahl executable.
First we need to extract the tar archive we copied onto the cluster.
Make sure you’re in your home directory:
then extract the tarball:
OUTPUT
amdahl/
amdahl/.github/
amdahl/.github/workflows/
amdahl/.github/workflows/python-publish.yml
amdahl/requirements.txt
amdahl/README.md
amdahl/amdahl/
amdahl/amdahl/amdahl.py
amdahl/amdahl/__init__.py
amdahl/amdahl/__main__.py
amdahl/.gitignore
amdahl/LICENSE
amdahl/setup.py
The code is python, so first of all we will create a new virtual environment to install our python packages into. An old version of Python is available without loading any modulefiles, but we’ll be using a more recent version so that all our code runs correctly.
Starting in your home directory (cd), we run
and then activate it using
OUTPUT
(.venv) yourUsername@login:~$
notice how the prompt has changed to include the name of the virtual environment (.venv).
Next, move into the extracted directory, then use pip to
install it in your virtual environment:
Note that the dot at the end . refers to the current
directory, and is needed for the command to work.
A test for success looks like this:
and the output looks a bit like this
OUTPUT
/some/path/yourUsername/.venv/bin/amdahl
Running a serial job
Having installed the amdahl program, we’ll return to our
home directory
Let’s create a serial jobscript for the amdahl program – we’ll call
it amdahl-serial.sh. We create a new file with
and write the jobscript
BASH
#!/bin/bash
#SBATCH -J serial-job
#SBATCH -p compute
#SBATCH -n 1
# Load the MPI module
module load openmpi/5.0.8
# Load the python virtual environment we need
source .venv/bin/activate
# Execute the task
amdahl
Submit the serial job
- Submit the amdahl serial job to the scheduler.
- Use SLURM commands to check that your job is running, and when it ends.
- Use
ls -tto locate the output file. The-toption sorts by time, showing the newest files first. What was the output?
sbatch amdahl-serial.shsqueue-
The cluster output should be written to a file in the folder you launched the job from.
OUTPUT
slurm-[newestjobID].out serial.sh slurm-[olderjobID].out jobscript.sh slurm-[otherolderjobID].outOUTPUT
Doing 30.000000 seconds of 'work' on 1 processor, which should take 30.000000 seconds with 0.800000 parallel proportion of the workload. Hello, World! I am process 0 of 1 on compute01. I will do all the serial 'work' for 6.321945 seconds. Hello, World! I am process 0 of 1 on compute01. I will do parallel 'work' for 25.066232 seconds. Total execution time (according to rank 0): 32.038031 seconds
We see that the default settings in the amdahl program run 30 seconds of work that is 80% parallel.
Since we only gave the job one CPU this job wasn’t really parallel: the same processor performed the ‘serial’ work for 6.3 seconds, then the ‘parallel’ part for 25 seconds, and no time was saved. The cluster can do better, if we ask.
Running a parallel job
The amdahl program uses the Message Passing Interface (MPI) for parallelism – this is a common tool on HPC systems.
What is MPI?
The Message Passing Interface is a set of tools which allow multiple tasks running simultaneously to communicate with each other. Typically a single executable is run multiple times, possibly on different machines, and the MPI tools are used to inform each instance of the executable about its sibling processes, and which instance it is. MPI also provides tools to allow communication between instances to coordinate work, exchange information about elements of the task, or to transfer data. An MPI instance typically has its own copy of all the local variables.
While MPI-aware executables can generally be run as stand-alone
programs, in order for them to run in parallel they must use an MPI
run-time environment, which is a specific implementation of the
MPI standard. To activate the MPI environment, the program
should be started via a command such as mpiexec (or
mpirun, or srun, etc. depending on the MPI
run-time you need to use), which will ensure that the appropriate
run-time support for parallelism is included.
MPI Runtime Arguments
On their own, commands such as mpiexec can take many
arguments specifying how many machines will participate in the
execution, and you might need these if you would like to run an MPI
program on your own (for example, on your laptop).
However, in the context of a queuing system it is frequently the case that the MPI run-time will obtain the necessary parameters from the queuing system, by examining the environment variables set when the job is launched.
Let’s modify the job script to request more cores and use the MPI run-time.
BASH
(.venv) yourUsername@login:~$ cp amdahl-serial.sh amdahl-parallel.sh
(.venv) yourUsername@login:~$ nano amdahl-parallel.sh
(.venv) yourUsername@login:~$ cat amdahl-parallel.sh
OUTPUT
#!/bin/bash
#SBATCH -J parallel-job
#SBATCH -p compute # Our cluster only has one partition, so no changes needed.
#SBATCH -n 2
#SBATCH -t 2:00
# Activate the python venv
source .venv/bin/activate
# Load the MPI module
module load openmpi/5.0.8
# Execute the task
mpirun amdahl
Then submit your job. Note that the submission command has not changed from how we submitted the serial job: all the parallel settings are in the batch file rather than the command line.
As before, use the status commands to check when your job runs
(squeue), and then inspect the job output file.
OUTPUT
Doing 30.000000 seconds of 'work' on 2 processors,
which should take 18.000000 seconds with 0.800000 parallel proportion of the workload.
Hello, World! I am process 0 of 2 on compute01. I will do all the serial 'work' for 6.917420 seconds.
Hello, World! I am process 0 of 2 on compute01. I will do parallel 'work' for 14.117882 seconds.
Hello, World! I am process 1 of 2 on compute01. I will do parallel 'work' for 12.386813 seconds.
Total execution time (according to rank 0): 21.066037 seconds
More cores means faster?
Splitting parallel work over more cores can be a good way to reduce the overall job execution time.
Most real-world problems have a serial element to the code, so the actual speed-up is less than the ideal you might get from a purely parallel task. By ideal speed up, we mean that doubling the number of cores would half the execution time.
In addition to the “fixed cost” of the serial work that can’t be shared across multiple cores, the speedup factor is influenced by:
- CPU design
- communication network between compute nodes
- MPI library implementations
- details of the MPI program itself
Submit a parallel job using more cores
The parallel job received twice as many processors as the serial job: does that mean it finished in half the time?
Resubmit the amdahl job requesting 4, then 8 cores.
How long does it take to execute compared with the serial version?
The parallel jobs took less time, but doubling the number of cores doesn’t half the time (it always took longer).
| Number of cores | Time (s) | Speed up | Ideal speed up |
|---|---|---|---|
| 1 | 32 | 1 | 1 |
| 2 | 21 | 1.5 | 2 |
| 4 | 12 | 2.7 | 4 |
| 8 | 9.3 | 3.4 | 8 |
The serial part of the work cannot be split so this is a “fixed cost” which limits the speed up. Rank 0 has to finish the serial work before distributing the parallel work across the ranks.
Using Amdahl’s Law, you can prove that with this program it is impossible to reach 5× speedup, no matter how many processors you have on hand. This is because the serial work takes ~6s and this can’t be split, so acts as a lower limit on runtime (32s/6s = 5.3).
Improve the time estimate
Consider the last job you ran. Use sacct to view how
long the job took.
Modify your jobscript to request a shorter time limit accordingly.
How close can you get?
Modifying your jobscript like this will set the time limit to 50 seconds.
SBATCH -t 00:00:50
- Use
SBATCH -n Xto request X cores - Parallel speed up isn’t usually linear