Transferring files
Last updated on 2025-10-15 | Edit this page
Estimated time: 35 minutes
Overview
Questions
- How do I transfer files to and from an HPC cluster?
Objectives
- Copy files from your laptop onto the cluster (and vice-versa)
- Download files directly onto the cluster
Performing work on a remote computer is not very useful if we cannot get files to or from the cluster. There are several options for transferring data between computing resources using CLI and GUI utilities, a few of which we will cover.
Download Lesson Files From the Internet Onto Your Laptop
One of the most straightforward ways to download files is to use
either curl or wget. One of these is usually
installed in most Linux shells, on Mac OS terminal and in GitBash. Any
file that can be downloaded in your web browser through a direct link
can be downloaded using curl or wget. This is
a quick way to download datasets or source code. The syntax for these
commands is
BASH
wget [-O new_name] https://some/link/to/a/file
curl [-o new_name] -L https://some/link/to/a/file
By default, curl and wget download files to
the same name as the file on the remote file server. The -O
option to wget and the -o option to
curl allow you to specify a new name for the file that you
download.
Note that curl needs the -L option to
follow links that redirect, whereas wget follows follows by
default.
Download the “tarball””
Try it out by downloading some material we’ll use later on, from a terminal on your local machine, using the URL of the current codebase:
https://github.com/hpc-carpentry/amdahl/tarball/main
The word “tarball” in the above URL refers to a compressed archive
format commonly used on Linux, which is the operating system the
majority of HPC cluster machines run. A tarball is a lot like a
.zip file. The actual file extension is
.tar.gz, which reflects the two-stage process used to
create the file: the files or folders are merged into a single file
using tar, which is then compressed using
gzip, so the file extension is “tar-dot-g-z.” That’s a
mouthful, so people often say “the xyz tarball” instead.
You may also see the extension .tgz, which is just an
abbreviation of .tar.gz.
Use one of the above commands to save the tarball as
amdahl.tar.gz, rather than using the same name as on the
server
[you@laptop:~]$ wget -O amdahl.tar.gz https://github.com/hpc-carpentry/amdahl/tarball/main
# or
[you@laptop:~]$ curl -o amdahl.tar.gz -L https://github.com/hpc-carpentry/amdahl/tarball/main
After downloading the file, use ls to see it in your working directory:
Archiving Files
One of the biggest challenges we often face when transferring data between remote HPC systems is that of large numbers of files. There is an overhead to transferring each individual file and when we are transferring large numbers of files these overheads combine to slow down our transfers to a large degree.
The solution to this problem is to archive multiple files into
smaller numbers of larger files before we transfer the data to improve
our transfer efficiency. Sometimes we will combine archiving with
compression to reduce the amount of data we have to transfer and so
speed up the transfer. The most common archiving command you will use on
a (Linux) HPC cluster is tar.
tar can be used to combine files and folders into a
single archive file and optionally compress the result. Let’s look at
the file we downloaded from the lesson site,
amdahl.tar.gz.
The .gz part stands for gzip, which is a
compression library. It’s common (but not necessary!) that this kind of
file can be interpreted by reading its name: it appears somebody took
files and folders relating to something called “amdahl,” wrapped them
all up into a single file with tar, then compressed that
archive with gzip to save space.
Let’s see if that is the case, without unpacking the file.
tar prints the “table of contents” with the -t
flag, for the file specified with the -f flag followed by
the filename. Note that you can concatenate the two flags: writing
-t -f is interchangeable with writing -tf
together. However, the argument following -f must be a
filename, so writing -ft will not work.
BASH
[you@laptop:~]$ tar -tf amdahl.tar.gz
hpc-carpentry-amdahl-46c9b4b/
hpc-carpentry-amdahl-46c9b4b/.github/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/python-publish.yml
hpc-carpentry-amdahl-46c9b4b/.gitignore
hpc-carpentry-amdahl-46c9b4b/LICENSE
hpc-carpentry-amdahl-46c9b4b/README.md
hpc-carpentry-amdahl-46c9b4b/amdahl/
hpc-carpentry-amdahl-46c9b4b/amdahl/__init__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/__main__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/amdahl.py
hpc-carpentry-amdahl-46c9b4b/requirements.txt
hpc-carpentry-amdahl-46c9b4b/setup.py
This example output shows a folder which contains a few files, where
46c9b4b is an 8-character git commit hash that will change
when the source material is updated.
Now let’s unpack the archive. We’ll run tar with a few common flags:
-
-xto extract the archive -
-vfor verbose output -
-zfor gzip compression -
-f TARBALLfor the file to be unpacked
Challenge
Using the flags above, unpack the source code tarball using
tar.
OUTPUT
hpc-carpentry-amdahl-46c9b4b/
hpc-carpentry-amdahl-46c9b4b/.github/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/
hpc-carpentry-amdahl-46c9b4b/.github/workflows/python-publish.yml
hpc-carpentry-amdahl-46c9b4b/.gitignore
hpc-carpentry-amdahl-46c9b4b/LICENSE
hpc-carpentry-amdahl-46c9b4b/README.md
hpc-carpentry-amdahl-46c9b4b/amdahl/
hpc-carpentry-amdahl-46c9b4b/amdahl/__init__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/__main__.py
hpc-carpentry-amdahl-46c9b4b/amdahl/amdahl.py
hpc-carpentry-amdahl-46c9b4b/requirements.txt
hpc-carpentry-amdahl-46c9b4b/setup.py
Note that we did not need to type out -x -v -z -f,
thanks to flag concatenation, though the command works identically
either way – so long as the concatenated list ends with f,
because the next string must specify the name of the file to
extract.
The folder has an unwieldly name, so let’s change that to something
more convenient. For this tasks we’ll use the mv command.
This can be used to move files and directories, or rename them in place,
and has this syntax:
We rename the directory using:
Now let’s check the size of the extracted directory and compare to
the compressed file size, using the du command for “disk
usage”. We’ll use the -s option to
summarise the total space used, and -h to
present it in a human-readable format.
BASH
[you@laptop:~]$ du -sh amdahl.tar.gz
8.0K amdahl.tar.gz
[you@laptop:~]$ du -sh amdahl
48K amdahl
Text files (including Python source code) compress nicely: the “tarball” is one-sixth the total size of the raw data!
If you want to reverse the process – compressing raw data instead of extracting it – set a c flag instead of x, set the archive filename, then provide a directory to compress:
OUTPUT
amdahl/
amdahl/.github/
amdahl/.github/workflows/
amdahl/.github/workflows/python-publish.yml
amdahl/.gitignore
amdahl/LICENSE
amdahl/README.md
amdahl/amdahl/
amdahl/amdahl/__init__.py
amdahl/amdahl/__main__.py
amdahl/amdahl/amdahl.py
amdahl/requirements.txt
amdahl/setup.py
If you give amdahl.tar.gz as the filename in the above
command, tar will update the existing tarball with any
changes you made to the files. That would mean adding the new
amdahl folder to the existing folder
(hpc-carpentry-amdahl-46c9b4b) inside the tarball, doubling
the size of the archive!
So don’t do that!
Transferring Single Files and Folders With scp
To copy a single file to or from the cluster, we can use
scp (“secure copy”). The syntax can be a little complex for
new users, but we’ll break it down. The scp command is a
relative of the ssh command we used to access the system,
and can use the same authentication mechanism as we used before.
To upload to another computer, the template command is
[you@laptop:~]$ scp local_file yourUsername@cluster_url:remote_destination
in which @ and : are field separators and
remote_destination is a path relative to your remote home directory, or
a new filename if you wish to change it, or both a relative path and a
new filename. If you don’t have a specific folder in mind you can omit
the remote_destination and the file will be copied to your
home directory on the remote computer (with its original name). If you
do include a remote_destination, note that scp
interprets this as follows: if it exists and is a folder, the file is
copied inside the folder; if it exists and is a file, the file is
overwritten with the contents of local_file; if it does not
exist, it is assumed to be a destination filename for
local_file.
Upload the lesson material to your remote home directory like so:
BASH
[you@laptop:~]$ scp compressed_code.tar.gz yourUsername@hpc-training.digitalmaterials-cdt.ac.uk:
Why Not Download the files directly to HPC?
Most computer clusters are protected from the open internet by a firewall. For enhanced security, some are configured to allow traffic inbound, but not outbound. This means that an authenticated user can send a file to a cluster machine, but a cluster machine cannot retrieve files from a user’s machine or the open Internet.
Try downloading the file directly! This can be done on the login node - no need for a batch job submission.
Copy a file from the HPC cluster to your laptop
So far we’ve copied files from a laptop to HPC, but how about the other way round?
The syntax is very similar, and the transfer is once again initiated from your laptop rather than from the HPC prompt. The command we will use swaps the order of the arguments such that it looks like this:
BASH
[you@laptop:~]$ scp yourUsername@hpc-training.digitalmaterials-cdt.ac.uk:remote_directory/remote_file local_destination
Using this as a guide, copy your hostname.sh script from
the HPC cluster to your laptop.
- Use
wgetto download files onto HPC - Use
scpfor copying files between laptop and HPC - Use
tarfor creating and unpacking archives