Parallel jobs

Last updated on 2025-10-20 | Edit this page

Estimated time: 45 minutes

Overview

Questions

How do I use multiple cores for my HPC job?
How much faster is it to use more cores?

Objectives

Be able to run a parallel job using SLURM
Appreciate that parallel speed up is rarely linear
Understand that there are diminishing returns when running a job using multiple cores
Understand these are due to the serial/parallel split, communication between cores/nodes, frequency of data writing

We are going to run a parallel job using the Amdahl software we previously copied onto the cluster. It is a “fake work” program that simulates a task with both serial and parallel components.

We will use it to investigate Amdahl’s Law which predicts the theoretical speedup of a task when multiple processors are used for parallel processing. It highlights that the speedup is limited by the portion of the task that cannot be parallelized.

Install the Amdahl Program

With the Amdahl source code on the cluster, we can install it, which will provide access to the amdahl executable.

First we need to extract the tar archive we copied onto the cluster.

Make sure you’re in your home directory:

BASH

yourUsername@login:~$ cd

then extract the tarball:

BASH

yourUsername@login:~$ tar -xzvf compressed_code.tar.gz

OUTPUT

amdahl/
amdahl/.github/
amdahl/.github/workflows/
amdahl/.github/workflows/python-publish.yml
amdahl/requirements.txt
amdahl/README.md
amdahl/amdahl/
amdahl/amdahl/amdahl.py
amdahl/amdahl/__init__.py
amdahl/amdahl/__main__.py
amdahl/.gitignore
amdahl/LICENSE
amdahl/setup.py

The code is python, so first of all we will create a new virtual environment to install our python packages into. An old version of Python is available without loading any modulefiles, but we’ll be using a more recent version so that all our code runs correctly.

Starting in your home directory (cd), we run

BASH

yourUsername@login:~$ module load python/3.13.5
yourUsername@login:~$ python3 -m venv .venv

and then activate it using

BASH

yourUsername@login:~$ source .venv/bin/activate

OUTPUT

(.venv) yourUsername@login:~$

notice how the prompt has changed to include the name of the virtual environment (.venv).

Next, move into the extracted directory, then use pip to install it in your virtual environment:

BASH

(.venv) yourUsername@login:~$ cd amdahl
(.venv) yourUsername@login:~/amdahl$ pip install .

Note that the dot at the end . refers to the current directory, and is needed for the command to work.

A test for success looks like this:

BASH

(.venv) yourUsername@login:~/amdahl$ which amdahl

and the output looks a bit like this

OUTPUT

/some/path/yourUsername/.venv/bin/amdahl

Running a serial job

Having installed the amdahl program, we’ll return to our home directory

BASH

(.venv) yourUsername@login:~/amdahl$ cd

Let’s create a serial jobscript for the amdahl program – we’ll call it amdahl-serial.sh. We create a new file with

BASH

(.venv) yourUsername@login:~$ nano amdahl-serial.sh

and write the jobscript

BASH

#!/bin/bash
#SBATCH -J serial-job
#SBATCH -p compute
#SBATCH -n 1

# Load the MPI module
module load openmpi/5.0.8

# Load the python virtual environment we need
source .venv/bin/activate

# Execute the task
amdahl

Challenge

Submit the serial job

Submit the amdahl serial job to the scheduler.
Use SLURM commands to check that your job is running, and when it ends.
Use ls -t to locate the output file. The -t option sorts by time, showing the newest files first. What was the output?

Show me the solution

sbatch amdahl-serial.sh
squeue

The cluster output should be written to a file in the folder you launched the job from.

BASH

(.venv) yourUsername@login:~$ ls -t

OUTPUT

slurm-[newestjobID].out  serial.sh          slurm-[olderjobID].out
jobscript.sh       slurm-[otherolderjobID].out

BASH

(.venv) yourUsername@login:~$ cat slurm-[newestjobID].out

OUTPUT

Doing 30.000000 seconds of 'work' on 1 processor,
 which should take 30.000000 seconds with 0.800000 parallel proportion of the workload.

  Hello, World! I am process 0 of 1 on compute01. I will do all the serial 'work' for 6.321945 seconds.
  Hello, World! I am process 0 of 1 on compute01. I will do parallel 'work' for 25.066232 seconds.

Total execution time (according to rank 0): 32.038031 seconds

We see that the default settings in the amdahl program run 30 seconds of work that is 80% parallel.

Since we only gave the job one CPU this job wasn’t really parallel: the same processor performed the ‘serial’ work for 6.3 seconds, then the ‘parallel’ part for 25 seconds, and no time was saved. The cluster can do better, if we ask.

Running a parallel job

The amdahl program uses the Message Passing Interface (MPI) for parallelism – this is a common tool on HPC systems.

Callout

What is MPI?

The Message Passing Interface is a set of tools which allow multiple tasks running simultaneously to communicate with each other. Typically a single executable is run multiple times, possibly on different machines, and the MPI tools are used to inform each instance of the executable about its sibling processes, and which instance it is. MPI also provides tools to allow communication between instances to coordinate work, exchange information about elements of the task, or to transfer data. An MPI instance typically has its own copy of all the local variables.

While MPI-aware executables can generally be run as stand-alone programs, in order for them to run in parallel they must use an MPI run-time environment, which is a specific implementation of the MPI standard. To activate the MPI environment, the program should be started via a command such as mpiexec (or mpirun, or srun, etc. depending on the MPI run-time you need to use), which will ensure that the appropriate run-time support for parallelism is included.

Callout

MPI Runtime Arguments

On their own, commands such as mpiexec can take many arguments specifying how many machines will participate in the execution, and you might need these if you would like to run an MPI program on your own (for example, on your laptop).

However, in the context of a queuing system it is frequently the case that the MPI run-time will obtain the necessary parameters from the queuing system, by examining the environment variables set when the job is launched.

Let’s modify the job script to request more cores and use the MPI run-time.

BASH

(.venv) yourUsername@login:~$ cp amdahl-serial.sh amdahl-parallel.sh
(.venv) yourUsername@login:~$ nano amdahl-parallel.sh
(.venv) yourUsername@login:~$ cat amdahl-parallel.sh

OUTPUT

#!/bin/bash
#SBATCH -J parallel-job
#SBATCH -p compute  # Our cluster only has one partition, so no changes needed.
#SBATCH -n 2
#SBATCH -t 2:00

# Activate the python venv
source .venv/bin/activate

# Load the MPI module
module load openmpi/5.0.8

# Execute the task
mpirun amdahl

Then submit your job. Note that the submission command has not changed from how we submitted the serial job: all the parallel settings are in the batch file rather than the command line.

BASH

yourUsername@login:~$ sbatch amdahl-parallel.sh

As before, use the status commands to check when your job runs (squeue), and then inspect the job output file.

BASH

yourUsername@login:~$ cat slurm-347178.out

OUTPUT

Doing 30.000000 seconds of 'work' on 2 processors,
 which should take 18.000000 seconds with 0.800000 parallel proportion of the workload.

  Hello, World! I am process 0 of 2 on compute01. I will do all the serial 'work' for 6.917420 seconds.
  Hello, World! I am process 0 of 2 on compute01. I will do parallel 'work' for 14.117882 seconds.

  Hello, World! I am process 1 of 2 on compute01. I will do parallel 'work' for 12.386813 seconds.

Total execution time (according to rank 0): 21.066037 seconds

More cores means faster?

Splitting parallel work over more cores can be a good way to reduce the overall job execution time.

Most real-world problems have a serial element to the code, so the actual speed-up is less than the ideal you might get from a purely parallel task. By ideal speed up, we mean that doubling the number of cores would half the execution time.

In addition to the “fixed cost” of the serial work that can’t be shared across multiple cores, the speedup factor is influenced by:

CPU design
communication network between compute nodes
MPI library implementations
details of the MPI program itself

Challenge

Submit a parallel job using more cores

The parallel job received twice as many processors as the serial job: does that mean it finished in half the time?

Resubmit the amdahl job requesting 4, then 8 cores.

How long does it take to execute compared with the serial version?

Show me the solution

The parallel jobs took less time, but doubling the number of cores doesn’t half the time (it always took longer).

Number of cores	Time (s)	Speed up	Ideal speed up
1	32	1	1
2	21	1.5	2
4	12	2.7	4
8	9.3	3.4	8

The serial part of the work cannot be split so this is a “fixed cost” which limits the speed up. Rank 0 has to finish the serial work before distributing the parallel work across the ranks.

Using Amdahl’s Law, you can prove that with this program it is impossible to reach 5× speedup, no matter how many processors you have on hand. This is because the serial work takes ~6s and this can’t be split, so acts as a lower limit on runtime (32s/6s = 5.3).

Challenge

Improve the time estimate

Consider the last job you ran. Use sacct to view how long the job took.

Modify your jobscript to request a shorter time limit accordingly.

How close can you get?

Hint

Modifying your jobscript like this will set the time limit to 50 seconds.

SBATCH -t 00:00:50

Key Points

Use SBATCH -n X to request X cores
Parallel speed up isn’t usually linear