Using conda env for PyTorch with CUDA support
Introduction
You can't actually ssh into a GPU node and install all the dependencies, because computation nodes are not allowed to reach internet. You must ssh into the login node, build the environment there (es: via conda), and then run your script inside the environment using Slurm task manager. So,first of all, enter into the login node:
ssh frontend1.hpc.sissa.it
Note: files should be placed in your home or your scratch folder. In the following commands we use “$USER” to automatically get your login username. If you prefer, you can replace its occurrences with your username, plain and simple.
Installing miniconda
If you don't already have it, install miniconda in your home:
cd ~ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh mkdir -p ~/miniconda3 bash ~/miniconda.sh -b -u -p ~/miniconda3 rm ~/miniconda.sh
Building conda env
source ~/miniconda3/bin/activate conda create -n cudatest conda activate cudatest
Install dependecies
conda install conda-forge::pytorch-gpu
Exit conda env
conda deactivate conda deactivate
You will need to exit twice because first you exit your new conda env, and then you exit from conda base.
Test script
Place your script in the scratch, for example:
/scratch/$USER/torchcuda.py
Write this code inside the script:
#!python3 import os.path import sys import torch appdir = os.path.abspath(os.path.dirname(sys.argv[0])) cuda_available = torch.cuda.is_available() file = open(appdir+"/output.txt","w", encoding='utf-8') file.write(str(cuda_available)+"\n") file.close()
Run the script in a GPU queue
The sbatch command puts your script into a queue, in this example it's the gpu2 queue (Nvidia Tesla P100) with 4GB RAM and 8 CPU:
sbatch -p gpu2 --mem=4000 --cpus-per-task=8 conda run -n cudatest python3 /scratch/$USER/torchcuda.py
the command should return your JOBID:
Submitted batch job 14545847
You can also check that the task is running in the correct queue:
squeue -u $USER
As soon as the script is completed, slurm should write the standard output into a file in your home:
cat slurm-JOBID.out
For example:
cat slurm-14545847.out
Our sample code writes a file called output.txt in the same folder the script was placed, so it should be possible to read its content like this:
cat /scratch/$USER/output.txt
The file should just contain this line:
True
And this just tells you that PyTorch works and is able to access CUDA.
Note: a GPU queue can be used up to 12 hours: this means that your scripts must be written to be stopped and resumed without data loss.
Quick live test
Running an interactive task with a bash shell on a node with GPU Tesla P100, 4GB RAM, and 8 CPU:
srun -p gpu2 --mem=4000 --cpus-per-task=8 --pty bash -i nvidia-smi #Check that Nvidia GPU are actually available # Activate conda environment built on the login node source ~/miniconda3/bin/activate conda activate cudatest # Load PyTorch in an interactive Python session python3 import torch torch.cuda.is_available()