Troubleshooting
This topic lists the usual problems users have and how to fix them.
I cannot log in
Access to the cluster is only enabled for students participating in a course that uses the cluster as well as all TAs of such a course. Access is revoked on the last Monday of the semester holidays after a course.
Access to
student-jupyter.inf.ethz.ch
is furthermore restricted to students participating in a course that runs Jupyter notebooks.
My job is not starting
Run
squeue
to check the queue. If in the right-most column it lists a node name (
studgpu-node??
) then you will need to wait up to five minutes until that node is powered up.
Instead of the node name you may also get a status code:
- (QOSMaxGRESPerUser)
- You have requested too much RAM, too many GPUs or cpu cores.
- (Resources) or (QOSGrpGRESMinutes)
- You have not properly set the course name and runtime. See here for how to properly start a job. Try to run the examples, they should always work.
My job ends too soon
Some courses have a default job runtime set to 1h but allow longer jobs. You can choose a runtime up to the maximum allowed per course as described
here.
My job got canceled
This only affects courses that have long-running jobs. These jobs get canceled when the cluster if full and users of other courses start more short jobs or jupyter notebooks. Interrupted jobs will be automatically restarted if the cluster has less load.
I am out of GPU time for a course
Each user only gets a fixed amount to finish a course. If that is not enough then please contact a TA to find a solution.
My home directory is full
You have 20GB of space and will need to get by with this for all courses.
One place where space is usually wasted is the
pip cache. Run these two commands to get its size and purge the cache:
python3 -m pip cache info
python3 -m pip cache purge
To avoid filling the cache run
pip
always with the
--no-cache-dir
option.
If you are working with conda then there is only an option to periodically clean:
To see where your space is used run this command:
du -sh ~/* ~/.local ~/.cache
If you have data sets or models that were downloaded via git, for instance from
Hugging Face, run
lfs-hardlink path_to_checkout
In git repositories, every file exists twice and this will reduce it to one copy that is hard linked to two or more locations.
Ultimately you can also
copy away data that you do not actively need right now to your personal device.
Project work storage or scratch is full
If you get out of space errors for your personals scratch space under
/work/scratch
of for one of the work directories under
/work/courses
or
/work/users
then you either have too much data there or too many files and you will need to delete something.
The limits are as follows:
Directory |
Space |
Files |
/work/courses/*/* |
(depends on course) |
(10000 per 1GB storage) |
/work/users/* |
(depends on project) |
(10000 per 1GB storage) |
/work/scratch/* |
100GB |
100000 |
My JupyterHub server is not starting
First make sure that:
- There are only green check marks for the course resources on the Start Jupyter Server page.
- You picked a suitable environment.
- If you provide your own environment, specify the absolute path to the
bin
directory of your environment.
- Your home directory is not completely full.
- You have waited five minutes after starting. The cluster may need to power on a node to run your job.
To check if the environment is the problem or not try if the
Minimal Jupyter environment with pyTorch and Cuda x.xx starts.
Next tick the check mark for write a slurm logfile on the
Start Jupyter Server page and try again. If the server does not start then
log in to a login node and look at the last logfile that was created in your home directory. This should show you what went wrong.
You may see something like this:
slurmstepd: error: execve(): /cluster/courses/.../jupyter/bin/batchspawner-singleuser: Permission denied
This error usually means that whoever prepared the environment (a TA) did not set the access permissions for the environment right
or has used a
python3
binary from a local installation in a home directory, most likely a
miniconda
installation. In this case contact a TA, only the person who has set up the environment can fix this.
VSCode cannot monitor all files for changes
Because of performance problems, the number of files that can be watched by a user is currently limited to
16384
but may be set to a lower limit in the future. Please edit the file
.vscode-server/data/Machine/settings.json
and exclude all files in a project that do not need monitoring. Common known directories that do not need watching are already excluded.
Alternatively move everything that is not code outside of the project directory.
The number of active
inotify watches can be displayed with the command
/cluster/admin/tools/inotify-info
.
I get errors about incompatible Cuda version when I install python modules
Read the error message as it usually tells you which version of Cuda you need. Read on
here for how to activate a particular version of Cuda.
Some software that I need is not installed
Write to
support@inf.ethz.ch
and let us know what you are missing and why you need it. We'll have a look at your request and install the necessary packages on all login nodes and GPU nodes if it is not too complicated.