Using conda env for PyTorch with CUDA support

Introduction

You can't actually ssh into a GPU node and install all the dependencies, because computation nodes are not allowed to reach internet. You must ssh into the login node, build the environment there (es: via conda), and then run your script inside the environment using Slurm task manager. So,first of all, enter into the login node:

ssh frontend1.hpc.sissa.it

Note: files should be placed in your home or your scratch folder. In the following commands we use “$USER” to automatically get your login username. If you prefer, you can replace its occurrences with your username, plain and simple.

Installing miniconda

If you don't already have it, install miniconda in your home:

cd ~
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
mkdir -p ~/miniconda3
bash ~/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda.sh

Building conda env

source ~/miniconda3/bin/activate
conda create -n cudatest
conda activate cudatest

Install dependecies

conda install conda-forge::pytorch-gpu

Exit conda env

conda deactivate
conda deactivate

You will need to exit twice because first you exit your new conda env, and then you exit from conda base.

Test script

Place your script in the scratch, for example:

/scratch/$USER/torchcuda.py

Write this code inside the script:

#!python3
import os.path
import sys

import torch

appdir = os.path.abspath(os.path.dirname(sys.argv[0]))

cuda_available = torch.cuda.is_available()

file = open(appdir+"/output.txt","w", encoding='utf-8')
file.write(str(cuda_available)+"\n")
file.close()

Run the script in a GPU queue

The sbatch command puts your script into a queue, in this example it's the gpu2 queue (Nvidia Tesla P100) with 4GB RAM and 8 CPU:

sbatch -p gpu2 --mem=4000 --cpus-per-task=8 conda run -n cudatest python3 /scratch/$USER/torchcuda.py

the command should return your JOBID:

Submitted batch job 14545847

You can also check that the task is running in the correct queue:

squeue -u $USER

As soon as the script is completed, slurm should write the standard output into a file in your home:

cat slurm-JOBID.out

For example:

cat slurm-14545847.out

Our sample code writes a file called output.txt in the same folder the script was placed, so it should be possible to read its content like this:

cat /scratch/$USER/output.txt

The file should just contain this line:

True

And this just tells you that PyTorch works and is able to access CUDA.

Note: a GPU queue can be used up to 12 hours: this means that your scripts must be written to be stopped and resumed without data loss.

Quick live test

Note: You can create an interactive task for quick tests, but it is a bad idea for long running tasks, because it's not efficient at all. If you're doing “serious” work, please write a script and run it with sbatch.

Running an interactive task with a bash shell on a node with GPU Tesla P100, 4GB RAM, and 8 CPU:

srun -p gpu2 --mem=4000 --cpus-per-task=8 --pty bash -i
nvidia-smi #Check that Nvidia GPU are actually available
# Activate conda environment built on the login node
source ~/miniconda3/bin/activate
conda activate cudatest
# Load PyTorch in an interactive Python session
python3
import torch
torch.cuda.is_available()