Math Slurm Docs
Requirements
- Confirm with your PI that you have access to Slurm and you will be added to a PI-YEAR account
- Confirm that you can ssh to stat.math.mcgill.ca or jump.math.mcgill.ca with your McGill account
- Confirm that you have added a SSH key to your account on the systems
- Confirm that you have been added to the Slack channel by sending a request to Prof. Elliot Paquette
Resources
Host | GPU Specs | Notes |
---|---|---|
gpu-1 | 2 x Tesla P100-PCIE-16GB | Old - Compatible with miniconda-winter2025 |
aogpu2 | 4 x GeForce GTX 1080 Ti - 11GB | Old - Compatible with miniconda-winter2025 |
aogpu3 | 4 x GeForce RTX 2080 Ti - 11GB | Compatible with current miniconda |
math-h100-r01 | 3 x H100 NVL - 96GB | Compatible with current miniconda |
Limits and Partitions
Please use the QOS matching the partition above that you need to use
- gpu_h100_pro : 1 GPU, 24CPU, 256GB of RAM and 24 hour jobs
- gpu_debug_nvidia : 1 GPU, 2CPU, 16GB of RAM and 1 week jobs
- gpu_debug_tesla : 1 GPU, 2CPU, 32GB of RAM and 48 hour jobs
Usage
- Use the submit nodes (jump and stat) to submit and cancel your jobs
- Modules avilable are different on submit vs gpu nodes
- Use the GPU nodes to build your pyhon modules or load the existing minicom modules
Easy HowTo
Starting Jobs
The following shows how to connect to stat.math.mcgill.ca, load slurm and run a quick batch file on the Tesla GPUs
ssh mcgill-username@stat.math.mcgill.ca
- Connect to stat.math.mcgill.camodule load slurm
- Load Slurm module
Confirm you know your Account : PI-YEAR
Run your code/commnds
srun -p gpu_debug_tesla -q gpu_debug_tesla -A PI-YEAR --mem=1GB -t 1:00:00 --ntasks=1 --gpus=1 batch.sh
Pausing or Suspending a Job
squeue --user=YourShortUsername
- This will show your Job IDscontrol suspend job_id
Resuming a Job
scontrol resume job_id
Stopping or Cancelling a Job
If you are done early (especially with interactive jobs). Free up the resources by cancelling your job.
scancel your_job-id
Using VScode Remotely with Slurm
- Confirm you have met the above requirements
Setup your Math account
- ssh to stat.math.mcgill.ca or jump.math.mcgill.ca depending on whether you are on campus/VPN or not.
- run the command: vscode-remote-setup
- Create your own vscode file eg. myvscode.sh (you can copy the contents below and customize )
#!/bin/bash
#
#SBATCH -p gpu_debug_nvidia # partition use only the debug queues for VScode
#SBATCH -c 2 # number of cores
#SBATCH --mem=4G
#SBATCH --gpus=1 # Make sure that number is within what is allowed by the QOS
#SBATCH --propagate=NONE # IMPORTANT for long jobs
#SBATCH -t 0-10:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
#SBATCH --qos=gpu_debug_nvidia # this should match the partition above with -p
#SBATCH --account=PI-YEAR # Ask your TA/PI for which account to use
#SBATCH --signal=B:TERM@60
### store relevant environment variables to a file in the home folder
env | awk -F= '$1~/^(SLURM|CUDA|NVIDIA_)/{print "export "$0}' > ~/.slurm-envvar.bash
module load dropbear # Necessary module to access slrum node
cleanup() {
echo "Caught signal - removing SLURM env file"
rm -f ~/.slurm-envvar.bash
}
trap 'cleanup' SIGTERM
### start the dropbear SSH server
# Make sure you change PORT_CHANGE_ME to the port given when you ran vscode-remote-setup
dropbear \
-r ~/.dropbear/server-key -F -E -w -s -p PORT_CHANGE_ME \
-P ~/.dropbear/var/run/dropbear.pid
Make sure you updated the following in the above file:
- PI-YEAR
- PORT_CHANGE_ME (use value from vscode-remote-setup)
Now you are ready to submit your vscode job to enable remote access.
module load slurm
- Load Slurm module- submit your job
- Take note of the above job id and run the following to get the NODE_NAME to access with VScode
Setup your VScode machine
From the previous step you will need to know:
- McGill_USERNAME
- PORT_CHANGE_ME
- NODE_NAME
Add the following to your ~/.ssh/config on your client/home machine, modifying the above values in the config
Host jump
User McGill_USERNAME
Hostname jump.math.mcgill.ca
Host math-slurm
HostName NODE_NAME
ProxyJump jump
User McGill_USERNAME
Port PORT_CHANGE_ME
You can now use VScode with the Remote-SSH Extension and select math-slurm as the remote host
- Note you might need to enter your SSH passphrase twice if you have not added your ssh key to your agent
- Note you will have to accept the ssh key the first time after setting up vscode-remote-setup
If you run into problems please send an email to science.it@mcgill.ca with the subject [SLURM]