Math Slurm Docs (In Devel mode)
Requirements
- Confirm with your PI that you have access to Slurm and you will be added to a PI-YEAR account
- Confirm that you can ssh to stat.math.mcgill.ca or jump.math.mcgill.ca with your McGill account
- Confirm that you have added a SSH key to your account on the systems
Partitions
- gpu_h100_pro : 2 x 96GB GPUs (Nvidia H100)
- gpu_debug_nvidia : 8 x 11GB GPUs (4 x NVIDIA GeForce GTX 1080 Ti, 4 x NVIDIA GeForce RTX 2080 Ti)
- gpu_debug_tesla : 2 x 16GB GPUs (Tesla P100-PCIE-16GB)
Limits
Please use the QOS matching the partition above that you need to use
- gpu_h100_pro : 1 GPU, 24CPU, 128GB of RAM and 24 hour jobs
- gpu_debug_nvidia : 1 GPU, 2CPU, 16GB of RAM and 1 week jobs
- gpu_debug_tesla : 1 GPU, 2CPU, 32GB of RAM and 48 hour jobs
Usage
- Use the submit nodes (jump and stat) to submit and cancel your jobs
- Modules avilable are different on submit vs gpu nodes
- Use the GPU nodes to build your pyhon modules or load the existing minicom modules
Easy HowTo
The following shows how to connect to stat.math.mcgill.ca, load slurm and run a quick batch file on the Tesla GPUs
ssh mcgill-username@stat.math.mcgill.ca
- Connect to stat.math.mcgill.camodule load slurm
- Load Slurm module- confirm you know your Account : PI-YEAR
srun -p gpu_debug_tesla -q gpu_debug_tesla -A PI-YEAR --mem=1GB -t 1:00:00 --ntasks=1 --gpus=1 batch.sh
- Run your code/commands
Using VScode Remotely with Slurm
- Confirm you have met the above requirements
Setup your Math account
- ssh to stat.math.mcgill.ca or jump.math.mcgill.ca depending on whether you are on campus/VPN or not.
- run the command: vscode-remote-setup
- Create your own vscode file eg. myvscode.sh (you can copy the contents below and customize )
#!/bin/bash # #SBATCH -p gpu_debug_nvidia # partition use only the debug queues for VScode #SBATCH -c 2 # number of cores #SBATCH --mem=4G #SBATCH --gpus=1 # Make sure that number is within what is allowed by the QOS #SBATCH --propagate=NONE # IMPORTANT for long jobs #SBATCH -t 0-10:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR #SBATCH --qos=gpu_debug_nvidia # this should match the partition above with -p #SBATCH --account=PI-YEAR # Ask your TA/PI for which account to use #SBATCH --signal=B:TERM@60 ### store relevant environment variables to a file in the home folder env | awk -F= '$1~/^(SLURM|CUDA|NVIDIA_)/{print "export "$0}' > ~/.slurm-envvar.bash module load dropbear # Necessary module to access slrum node cleanup() { echo "Caught signal - removing SLURM env file" rm -f ~/.slurm-envvar.bash } trap 'cleanup' SIGTERM ### start the dropbear SSH server # Make sure you change PORT_CHANGE_ME to the port given when you ran vscode-remote-setup dropbear \ -r ~/.dropbear/server-key -F -E -w -s -p PORT_CHANGE_ME \ -P ~/.dropbear/var/run/dropbear.pid
Make sure you updated the following in the above file:
- PI-YEAR
- PORT_CHANGE_ME (use value from vscode-remote-setup)
Now you are ready to submit your vscode job to enable remote access.
module load slurm
- Load Slurm module-
submit your job
-
Take note of the above job id and run the following to get the NODE_NAME to access with VScode
Setup your VScode machine
From the previous step you will need to know:
- McGill_USERNAME
- PORT_CHANGE_ME
- NODE_NAME
Add the following to your ~/.ssh/config on your client/home machine, modifying the above values in the config
Host jump
User McGill_USERNAME
Hostname jump.math.mcgill.ca
Host math-slurm
HostName NODE_NAME
ProxyJump jump
User McGill_USERNAME
Port PORT_CHANGE_ME
You can now use VScode with the Remote-SSH Extension and select math-slurm as the remote host
- Note you might need to enter your SSH passphrase twice if you have not added your ssh key to your agent
- Note you will have to accept the ssh key the first time after setting up vscode-remote-setup
If you run into problems please send an email to science.it@mcgill.ca with the subject [SLURM]