Skip to content

Math Slurm Docs

Requirements

  • Confirm with your PI that you have access to Slurm and you will be added to a PI-YEAR account
  • Confirm that you can ssh to stat.math.mcgill.ca or jump.math.mcgill.ca with your McGill account
  • Confirm that you have added a SSH key to your account on the systems
  • Confirm that you have been added to the Slack channel by sending a request to Prof. Elliot Paquette

Resources

Host GPU Specs Notes
gpu-1 2 x Tesla P100-PCIE-16GB Old - Compatible with miniconda-winter2025
aogpu2 4 x GeForce GTX 1080 Ti - 11GB Old - Compatible with miniconda-winter2025
aogpu3 4 x GeForce RTX 2080 Ti - 11GB Compatible with current miniconda
math-h100-r01 3 x H100 NVL - 96GB Compatible with current miniconda

Limits and Partitions

Please use the QOS matching the partition above that you need to use

  • gpu_h100_pro        : 1 GPU, 24CPU, 256GB of RAM and 24 hour jobs
  • gpu_debug_nvidia : 1 GPU, 2CPU, 16GB of RAM and 1 week jobs
  • gpu_debug_tesla   : 1 GPU, 2CPU, 32GB of RAM and 48 hour jobs

Usage

  • Use the submit nodes (jump and stat) to submit and cancel your jobs
  • Modules avilable are different on submit vs gpu nodes
  • Use the GPU nodes to build your pyhon modules or load the existing minicom modules

Easy HowTo

Starting Jobs

The following shows how to connect to stat.math.mcgill.ca, load slurm and run a quick batch file on the Tesla GPUs

  • ssh mcgill-username@stat.math.mcgill.ca - Connect to stat.math.mcgill.ca
  • module load slurm - Load Slurm module

Confirm you know your Account : PI-YEAR

Run your code/commnds

  • srun -p gpu_debug_tesla -q gpu_debug_tesla -A PI-YEAR --mem=1GB -t 1:00:00 --ntasks=1 --gpus=1 batch.sh

Pausing or Suspending a Job

  • squeue --user=YourShortUsername - This will show your Job ID
  • scontrol suspend job_id

Resuming a Job

  • scontrol resume job_id

Stopping or Cancelling a Job

If you are done early (especially with interactive jobs). Free up the resources by cancelling your job.

  • scancel your_job-id

Using VScode Remotely with Slurm

  • Confirm you have met the above requirements

Setup your Math account

  • ssh to stat.math.mcgill.ca or jump.math.mcgill.ca depending on whether you are on campus/VPN or not.
  • run the command: vscode-remote-setup

Sample output

  • Create your own vscode file eg. myvscode.sh (you can copy the contents below and customize )
#!/bin/bash
#
#SBATCH -p gpu_debug_nvidia # partition use only the debug queues for VScode
#SBATCH -c 2 # number of cores
#SBATCH --mem=4G
#SBATCH --gpus=1 # Make sure that number is within what is allowed by the QOS
#SBATCH --propagate=NONE # IMPORTANT for long jobs
#SBATCH -t 0-10:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
#SBATCH --qos=gpu_debug_nvidia # this should match the partition above with -p
#SBATCH --account=PI-YEAR # Ask your TA/PI for which account to use
#SBATCH --signal=B:TERM@60
### store relevant environment variables to a file in the home folder
env | awk -F= '$1~/^(SLURM|CUDA|NVIDIA_)/{print "export "$0}' > ~/.slurm-envvar.bash

module load dropbear # Necessary module to access slrum node

cleanup() {
    echo "Caught signal - removing SLURM env file"
    rm -f ~/.slurm-envvar.bash
}
trap 'cleanup' SIGTERM

### start the dropbear SSH server
# Make sure you change PORT_CHANGE_ME to the port given when you ran vscode-remote-setup
dropbear \
    -r ~/.dropbear/server-key -F -E -w -s -p PORT_CHANGE_ME \
    -P ~/.dropbear/var/run/dropbear.pid

Make sure you updated the following in the above file:

  • PI-YEAR
  • PORT_CHANGE_ME (use value from vscode-remote-setup)

Now you are ready to submit your vscode job to enable remote access.

  • module load slurm - Load Slurm module
  • submit your job
sbatch myvscode.sh
  • Take note of the above job id and run the following to get the NODE_NAME to access with VScode
squeue --job Your_Job_ID

Setup your VScode machine

From the previous step you will need to know:

  • McGill_USERNAME
  • PORT_CHANGE_ME
  • NODE_NAME

Add the following to your ~/.ssh/config on your client/home machine, modifying the above values in the config

Host jump
    User McGill_USERNAME
    Hostname jump.math.mcgill.ca

Host math-slurm
    HostName NODE_NAME
    ProxyJump jump
    User McGill_USERNAME
    Port PORT_CHANGE_ME

You can now use VScode with the Remote-SSH Extension and select math-slurm as the remote host

  • Note you might need to enter your SSH passphrase twice if you have not added your ssh key to your agent
  • Note you will have to accept the ssh key the first time after setting up vscode-remote-setup

If you run into problems please send an email to science.it@mcgill.ca with the subject [SLURM]