Running Jobs

To run a command as a job you can either use srun to quickly run it interactively or sbatch to spool it for asynchronous execution.

In either case you always need to supply the following two parameters for the job:

  1. The course for which you run the job. You can see the course tags that you can use when you log in via ssh.
  2. The maximum runtime you expect this command to run. This can be omitted in which case a default of 60 minutes is used.

A GPU is automatically added to the job for courses that use one GPU. It is only necessary to add --gpus for courses that allow more then one GPU per job.

Using srun

For a simple, short-running command you can use srun with the -A and -t switch. For instance, to get the properties of the GPU at your disposal (and at most spend ten seconds) run:

srun -A {course tag} -G 1 -t 00:10 -o nvidia-smi.out nvidia-smi

Running Interactive Jobs

If you need to interact with a program via terminal then you can also use srun with the --pty argument. To get an interactive bash on a node for at most 60 minutes you can run:

srun --pty -A {course tag} -t 60 bash

Using sbatch

For sbatch you will put your commands in a script and add comment lines at the beginning that contain additional parameters for sbatch. The example from above but using sbatch would require a script containing this:

#!/bin/bash

#SBATCH --time=00:10
#SBATCH --account={course tag}
#SBATCH --output=nvidia-smi.out

nvidia-smi

To send the script to the cluster for execution run:

sbatch batch.sh

Using Modules

To use the modules command in a batch script your first command must be the following (note the dot):

. /etc/profile.d/modules.sh

For example:

#!/bin/bash

#SBATCH --time=00:10
#SBATCH --account={course tag}
#SBATCH --output=nvcc.out

. /etc/profile.d/modules.sh
module add cuda/12.1
nvcc --version

Temporary Space and Network Scratch

Each job receives a dedicated temporary directory under /tmp ($TMPDIR). This directory is located on the local SATA SSD. Data placed there is deleted when the job ends. There are also limitations on how much space you can use.

You also have a personal network scratch directory under /work/scratch that is accessible from all nodes, including the login nodes. Old data is automatically deleted, how soon depends on how much you put in there.

Checking the Job Queue

To see if you job is still waiting in the queue run:

squeue

If you see jobs listed then they are either executing or they are waiting, in case of the later there will be a reason given why the job cannot start yet.

This cluster is maximizing energy efficiency and powers down nodes that are idle. It may first need to start a node to run your job if all others are busy. This can delay the start for five minutes.

Output

Terminal output of commands in the job by default go to a file slurm-{job ID}.out in your home directory. The file name for the output can also be set. In the above examples the terminal output of the job will be in the file nvidia-smi.out.

Aborting Jobs

If you already know that a running or spooled job is not going to do the right thing, cancel it using

scancel {job ID}

This will save resources and energy.

Page URL: https://isg.inf.ethz.ch/bin/view/Main/HelpClusterComputingStudentClusterRunningJobs
2024-12-21
© 2024 Eidgenössische Technische Hochschule Zürich