Getting Started#
Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the HPC-UGent infrastructure and submitting your very first job. We'll also walk you through the process step by step using a practical example.
In addition to this chapter, you might find the recording of the Introduction to HPC-UGent training session to be a useful resource.
Before proceeding, read the introduction to HPC to gain an understanding of the HPC-UGent infrastructure and related terminology.
Getting Access#
To get access to the HPC-UGent infrastructure, visit Getting an HPC Account.
If you have not used Linux before, now would be a good time to follow our Linux Tutorial.
A typical workflow looks like this:#
- Connect to the login nodes
- Transfer your files to the HPC-UGent infrastructure
- Optional: compile your code and test it
- Create a job script and submit your job
- Wait for job to be executed
- Study the results generated by your jobs, either on the cluster or after downloading them locally.
We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using TensorFlow; see the example scripts.
Getting Connected#
There are two options to connect
- Using a terminal to connect via SSH (for power users) (see First Time connection to the HPC-UGent infrastructure)
- Using the web portal
Considering your operating system is Linux,
it is recommended to make use of the ssh
command in a terminal to get the most flexibility.
Assuming you have already generated SSH keys in the previous step (Getting Access), and that they are in a default location, you should now be able to login by running the following command:
ssh vsc40000@login.hpc.ugent.be
User your own VSC account id
Replace vsc40000 with your VSC account id (see https://account.vscentrum.be)
Tip
You can also still use the web portal (see shell access on web portal)
Info
When having problems see the connection issues section on the troubleshooting page.
Transfer your files#
Now that you can login, it is time to transfer files from your local computer to your home directory on the HPC-UGent infrastructure.
Download tensorflow_mnist.py and run.sh example scripts to your computer (from here).
On your local machine you can run:
curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/tensorflow_mnist.py
curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/run.sh
Using the scp
command, the files can be copied from your local host to your home directory (~
) on the remote host (HPC).
scp tensorflow_mnist.py run.sh vsc40000login.hpc.ugent.be:~
ssh vsc40000@login.hpc.ugent.be
User your own VSC account id
Replace vsc40000 with your VSC account id (see https://account.vscentrum.be)
Info
For more information about transfering files or scp
, see tranfer files from/to hpc.
When running ls
in your session on the HPC-UGent infrastructure, you should see the two files listed in your home directory (~
):
$ ls ~
run.sh tensorflow_mnist.py
When you do not see these files, make sure you uploaded the files to your home directory.
Submitting a job#
Jobs are submitted and executed using job scripts. In our case run.sh can be used as a (very minimal) job script.
A job script is a shell script, a text file that specifies the resources,
the software that is used (via module load
statements),
and the steps that should be executed to run the calculation.
Our job script looks like this:
#!/bin/bash
module load TensorFlow/2.15.1-foss-2023a
python tensorflow_mnist.py
The jobs you submit are per default executed on cluser/doduo, you can swap to another cluster by issuing the following command.
module swap cluster/donphan
Tip
When submitting jobs with limited amount of resources, it is recommended to use the debug/interactive cluster: donphan
.
To get a list of all clusters and their hardware, see https://www.ugent.be/hpc/en/infrastructure.
This job script can now be submitted to the cluster's job system for execution, using the qsub (queue submit) command:
$ qsub run.sh
123456
This command returns a job identifier (123456) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job.
Make sure you understand what the module
command does
Note that the module commands only modify environment variables. For instance, running module swap cluster/donphan
will update your shell environment so that qsub
submits a job to the donphan
cluster,
but our active shell session is still running on the login node.
It is important to understand that while module
commands affect your session environment, they do not change where the commands your are running are being executed: they will still be run on the login node you are on.
When you submit a job script however, the commands in the job script will be run on a workernode of the cluster the job was submitted to (like donphan
).
For detailed information about module
commands, read the running batch jobs chapter.
Wait for job to be executed#
Your job is put into a queue before being executed, so it may take a while before it actually starts. (see when will my job start? for scheduling policy).
You can get an overview of the active jobs using the qstat
command:
$ qstat
Job ID Name User Time Use S Queue
---------- ---------------- --------------- -------- - -------
123456 run.sh vsc40000 0:00:00 Q donphan
Eventually, after entering qstat
again you should see that your job has started running:
$ qstat
Job ID Name User Time Use S Queue
---------- ---------------- --------------- -------- - -------
123456 run.sh vsc40000 0:00:01 R donphan
If you don't see your job in the output of the qstat
command anymore, your job has likely completed.
Read this section on how to interpret the output.
Inspect your results#
When your job finishes it generates 2 output files:
- One for normal output messages (stdout output channel).
- One for warning and error messages (stderr output channel).
By default located in the directory where you issued qsub
.
Info
For more information about the stdout and stderr output channels, see this section.
In our example when running ls
in the current directory you should see 2 new files:
- run.sh.o123456, containing normal output messages produced by job 123456;
- run.sh.e123456, containing errors and warnings produced by job 123456.
Info
run.sh.e123456 should be empty (no errors or warnings).
Use your own job ID
Replace 123456 with the jobid you got from the qstat
command (see above) or simply look for added files in your current directory by running ls
.
When examining the contents of run.sh.o123456
you will see something like this:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
Epoch 1/5
1875/1875 [==============================] - 2s 823us/step - loss: 0.2960 - accuracy: 0.9133
Epoch 2/5
1875/1875 [==============================] - 1s 771us/step - loss: 0.1427 - accuracy: 0.9571
Epoch 3/5
1875/1875 [==============================] - 1s 767us/step - loss: 0.1070 - accuracy: 0.9675
Epoch 4/5
1875/1875 [==============================] - 1s 764us/step - loss: 0.0881 - accuracy: 0.9727
Epoch 5/5
1875/1875 [==============================] - 1s 764us/step - loss: 0.0741 - accuracy: 0.9768
313/313 - 0s - loss: 0.0782 - accuracy: 0.9764
Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accuracy.
Warning
When using TensorFlow specifically, you should actually submit jobs to a GPU cluster for better performance, see GPU clusters.
For the purpose of this example, we are running a very small TensorFlow workload on a CPU-only cluster.
Next steps#
- Running interactive jobs
- Running jobs with input/output data
- Multi core jobs/Parallel Computing
- Interactive and debug cluster
For more examples see Program examples and Job script examples