Calcul Canada
- Infos compte et expiration
- Connexion
- Job / Tâche
- Mode intéractif
- Annuler une tâche
- Vérifier continuellement le statut
- Installer des modules
- Utiliser Jupyter Notebook
Références:
- https://docs.computecanada.ca/wiki/Compute_Canada_Documentation
- https://docs.computecanada.ca/wiki/Cedar
- https://docs.computecanada.ca/wiki/Beluga
- https://docs.computecanada.ca/wiki/TensorFlow
- https://docs.computecanada.ca/wiki/AI_and_Machine_Learning
- https://docs.computecanada.ca/wiki/Handling_large_collections_of_files
- https://docs.computecanada.ca/wiki/Tutoriel_Apprentissage_machine/en#Checkpointing_a_long-running_job
- https://docs.computecanada.ca/wiki/Python
- https://docs.computecanada.ca/wiki/Jupyter
- SLURM SBATCH https://slurm.schedmd.com/sbatch.html
- SLURM Tutoriel https://slurm.schedmd.com/tutorials.html
Infos compte et expiration
- Se connecter au site https://ccdb.computecanada.ca/ avec vincent.lefalher@usherbrooke.ca et le mot de passe. L’information la plus pertinente ici est la date d’expiration.
Compte Calcul Canada de Vincent Le Falher (CCI: tvh-181, Nom d'utilisateur: vincelf)
Rôles activés
Identifiant de rôle de Calcul Canada (CCRI): tvh-181-01
Étudiant à la maîtrise, Département de géomatique appliquée, Un. de Sherbrooke, activé
expire le 1 mai 2020
Parrainé par
tpb-973-02 , Mickael Germain: Professeur, Département de géomatique appliquée, Un. de Sherbrooke
Projet avec allocation de ressources
Identifiant du projet (RAPI) Nom du groupe Statut Titre Allocations Membre? Responsable? Propriétaire?
tpb-973-ab def-germ2201-ab
Activé Apprentissage profond en télédétection 3 allocations actives - Hors CAR
Oui Non Non
Compte
Rôle
Personne Vincent Le Falher (CCI: tvh-181)
Nom d'utilisateur vincelf
Nom du groupe vincelf
UID 3095395
GID 3095395
Compte par défaut? Oui
Créé le 2020-01-23 18:55
Connexion
Prérequis
- Vérifier le status des systèmes:
https://status.computecanada.ca/
Il peut arriver qu’un serveur soit indisponible.
lefv2603@lefv2603-jetsonnano:~$ ssh vincelf@cedar.computecanada.ca
The Cedar login nodes are currently down for emergency maintenance. We appologize for the inconvenience.
Ouvrir une connexion
- Ouvrir un terminal
- faire un nssh sur le serveur désigné. Il y en a qui sont intéressant: cedar (WestGrid), beluga (ETS) et Graham (UofWaterloo)
ssh vincelf@cedar.computecanada.ca ssh vincelf@beluga.computecanada.ca ssh vincelf@graham.computecanada.ca
- Saisir le mot de passe qui a été fourni
Une fois connecté, il y a un espace dans home (~) et ~/projects (/home/vincelf/projects/def-germ2201-ab/vincelf). Les scripts peuvent résider dans le home(~) mais la commande doit être exécuté à partir de ~/projects.
Les scripts de tests se trouvent sur Cedar… qui est en cours de maintenance. Espérons qu’ils seront toujours là !
Job / Tâche
Scripts de tests
Référence: https://docs.computecanada.ca/wiki/PyTorch
- Créer un emplacement pour les scripts
[vincelf@beluga2 pytorch-test]$ pwd /home/vincelf/gae724/src/pytorch-test
- Assigner les variables d’environnements utilisées par le sbatch dans ~/.bashrc
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
export SLURM_ACCOUNT=def-germ2201-ab
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SALLOC_ACCOUNT=$SLURM_ACCOUNT
# User specific aliases and functionsa
alias H='history'
alias Hg='history | grep '
- Scripts sbatch de test (adapté)
Note: Les nombres de cpu, gpu, mém et time semble élevé. Mais pour que la tâche puisse avoir une certaine priorité et être exécuté sans trop de délai, il vaut mieux en demander plus. C’est recommandé sur le site de CC.
–gres=gpu is not required for Beluga, as Béluga has only a single type of GPU, so no option can be provided. (https://docs.computecanada.ca/wiki/Using_GPUs_with_Slurm#On_B.C3.A9luga)
#!/bin/bash
## # of Nodes Node type CPU cores CPU memory # of GPUs NVIDIA GPU type PCIe bus topology
## 172 Béluga P100 GPU 40 191000M 4 V100-SXM2-16GB All GPUs associated with the same CPU socket
## one node on Beluga:
## cores usable memory
## 40 186 GiB (~4.6 GiB/core) +4 GPU
##SBATCH --nodes=1 # will use the 40 cpu, the 4 gpu and the full memory of 186 GiB (~4.6 GiB/core) on Beluga
##SBATCH --ntasks-per-node=32 # will create 32 tasks per node to execute the job
##SBATCH --ntasks=32
#SBATCH --gres=gpu:1 # Request GPU "generic resources"; Number of GPU(s) per node
#SBATCH --cpus-per-task=10 # Cores proportional to GPUs: 6 on Cedar, 16 on Graham, 10 on Beluga.; CPU cores/threads
#SBATCH --mem=48000M # Memory proportional to GPUs: 32000 Cedar, 64000 Graham, 48000M on Beluga.; memory per node; Requesting --mem=0 is interpreted by Slurm to mean "reserve all the available memory on each node assigned to the job."
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load python/3.6 cuda cudnn
SOURCEDIR=~/gae724/src/pytorch-test
# Prepare virtualenv
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install --no-index -r $SOURCEDIR/requirements.txt
nvidia-smi
python pytorch-test.py | tee mylog.txt
- Fichier requirements.txt pour python
torch
torchvision
torchtext
torchaudio
- script de test pytorch
import torch
x = torch.Tensor(5, 3)
print(x)
y = torch.rand(5, 3)
print(y)
# let us run the following only if CUDA is available
if torch.cuda.is_available():
x = x.cuda()
y = y.cuda()
print(x + y)
Exécution de la tâche (job), statut, résultat et sommaire
Référence: https://docs.computecanada.ca/wiki/Running_jobs
- Exécution de la tâche (job)
[vincelf@beluga2 pytorch-test]$ sbatch pytorch-test.sh
Submitted batch job 6881869
- Statut de la tâche
Note: Ne pas faire squeue sans le user, ni trop souvent La tâche peut être PD (PENDING) ou R (RUNNING)
[vincelf@beluga2 pytorch-test]$ squeue -u vincelf
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
6881869 vincelf def-germ2201 pytorch-test.s PD 3:00:00 1 10 gpu:v100:1 48000M (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
[vincelf@beluga2 pytorch-test]$ squeue -u vincelf
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
6881869 vincelf def-germ2201 pytorch-test.s R 2:59:36 1 10 gpu:v100:1 48000M blg5607 (None)
- Sommaîre de la tâche
[vincelf@beluga2 pytorch-test]$ seff 6881869
Job ID: 6881869
Cluster: beluga
User/Group: vincelf/vincelf
State: PENDING
Cores: 1
Efficiency not available for jobs in the PENDING state.
[vincelf@beluga2 pytorch-test]$ seff 6881869
Job ID: 6881869
Cluster: beluga
User/Group: vincelf/vincelf
State: RUNNING
Nodes: 1
Cores per node: 10
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:05:20 core-walltime
Job Wall-clock time: 00:00:32
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 46.88 GB (46.88 GB/node)
WARNING: Efficiency statistics may be misleading for RUNNING jobs.
[vincelf@beluga2 pytorch-test]$ seff 6881869
Job ID: 6881869
Cluster: beluga
User/Group: vincelf/vincelf
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 10
CPU Utilized: 00:00:37
CPU Efficiency: 7.55% of 00:08:10 core-walltime
Job Wall-clock time: 00:00:49
Memory Utilized: 1.46 GB
Memory Efficiency: 3.11% of 46.88 GB
- Résultat (output file)
[vincelf@beluga2 pytorch-test]$ ls -al
total 154
drwxr-x--- 2 vincelf vincelf 25600 Apr 16 22:54 .
drwxr-x--- 4 vincelf vincelf 25600 Apr 16 21:29 ..
-rw-r----- 1 vincelf vincelf 4117 Apr 16 22:48 blg5204-6881815.out
-rw-r----- 1 vincelf vincelf 2805 Apr 16 21:38 blg5207-6881291.out
-rw-r----- 1 vincelf vincelf 2805 Apr 16 21:59 blg5606-6881568.out
-rw-r----- 1 vincelf vincelf 4035 Apr 16 22:54 blg5607-6881869.out
-rw-r----- 1 vincelf vincelf 650 Apr 16 22:54 mylog.txt
-rw-r----- 1 vincelf vincelf 203 Apr 16 21:33 pytorch-test.py
-rwxr-x--- 1 vincelf vincelf 1323 Apr 16 22:46 pytorch-test.sh
-rw-r----- 1 vincelf vincelf 39 Apr 16 21:27 requirements.txt
[vincelf@beluga2 pytorch-test]$ cat mylog.txt
tensor([[3.0598e-37, 0.0000e+00, 3.0598e-37],
[0.0000e+00, 3.0598e-37, 0.0000e+00],
[1.8289e+28, 1.6532e+19, 4.0837e-40],
[1.3435e-30, 1.0995e+27, 4.4909e+21],
[1.7036e+19, 6.8388e-40, 1.8206e-34]])
tensor([[0.9127, 0.2009, 0.8643],
[0.4570, 0.4358, 0.2604],
[0.9498, 0.9427, 0.8709],
[0.0287, 0.1148, 0.4400],
[0.2843, 0.4051, 0.9094]])
tensor([[9.1267e-01, 2.0091e-01, 8.6429e-01],
[4.5696e-01, 4.3575e-01, 2.6037e-01],
[1.8289e+28, 1.6532e+19, 8.7085e-01],
[2.8674e-02, 1.0995e+27, 4.4909e+21],
[1.7036e+19, 4.0509e-01, 9.0939e-01]], device='cuda:0')
[vincelf@beluga2 pytorch-test]$ cat blg5607-6881869.out
Using base prefix '/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3'
New python executable in /localscratch/vincelf.6881869.0/env/bin/python
Installing setuptools, pip, wheel...done.
Ignoring pip: markers 'python_version < "3"' don't match your environment
Collecting torch (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 1))
Collecting torchvision (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting torchtext (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting torchaudio (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 4))
Collecting six (from torchvision->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting numpy (from torchvision->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting pillow-simd>=4.1.1 (from torchvision->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting tqdm (from torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting requests (from torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting certifi>=2017.4.17 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting idna<3,>=2.5 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting chardet<4,>=3.0.2 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Installing collected packages: torch, six, numpy, pillow-simd, torchvision, tqdm, certifi, idna, urllib3, chardet, requests, torchtext, torchaudio
Successfully installed certifi-2020.4.5.1 chardet-3.0.4 idna-2.9 numpy-1.18.1 pillow-simd-7.0.0.post3 requests-2.23.0 six-1.14.0 torch-1.4.0 torchaudio-0.2+7e15d2f torchtext-0.3.1 torchvision-0.5.0 tqdm-4.40.2 urllib3-1.25.8
Thu Apr 16 22:54:20 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off | 0 |
| N/A 35C P0 38W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
tensor([[3.0598e-37, 0.0000e+00, 3.0598e-37],
[0.0000e+00, 3.0598e-37, 0.0000e+00],
[1.8289e+28, 1.6532e+19, 4.0837e-40],
[1.3435e-30, 1.0995e+27, 4.4909e+21],
[1.7036e+19, 6.8388e-40, 1.8206e-34]])
tensor([[0.9127, 0.2009, 0.8643],
[0.4570, 0.4358, 0.2604],
[0.9498, 0.9427, 0.8709],
[0.0287, 0.1148, 0.4400],
[0.2843, 0.4051, 0.9094]])
tensor([[9.1267e-01, 2.0091e-01, 8.6429e-01],
[4.5696e-01, 4.3575e-01, 2.6037e-01],
[1.8289e+28, 1.6532e+19, 8.7085e-01],
[2.8674e-02, 1.0995e+27, 4.4909e+21],
[1.7036e+19, 4.0509e-01, 9.0939e-01]], device='cuda:0')
Mode intéractif
- Exécuter cette commande
salloc --account=def-germ2201-ab --gres=gpu:1 --cpus-per-task=10 --mem=48000M --time=0-3:00
- puis exécuter chaque commande du script une à la fois, sauf les SBATCH
[vincelf@beluga4 DeepLabv3FineTuning-master-test]$ cat DeepLabv3FineTuning-master-test.sh
#!/bin/bash
## # of Nodes Node type CPU cores CPU memory # of GPUs NVIDIA GPU type PCIe bus topology
## 172 Béluga P100 GPU 40 191000M 4 V100-SXM2-16GB All GPUs associated with the same CPU socket
## one node on Beluga:
## cores usable memory
## 40 186 GiB (~4.6 GiB/core) +4 GPU
#SBATCH --nodes=1 # will use the 40 cpu, the 4 gpu and the full memory of 186 GiB (~4.6 GiB/core) on Beluga
#SBATCH --ntasks-per-node=8 # will create 32 tasks per node to execute the job
##SBATCH --ntasks=32
##SBATCH --gres=gpu:4 # Request GPU "generic resources"; Number of GPU(s) per node
##SBATCH --cpus-per-task=10 # Cores proportional to GPUs: 6 on Cedar, 16 on Graham, 10 on Beluga.; CPU cores/threads
##SBATCH --mem=48000M # Memory proportional to GPUs: 32000 Cedar, 64000 Graham, 48000M on Beluga.; memory per node; Requesting --mem=0 is interpreted by Slurm to mean "reserve all the available memory on each node assigned to the job."
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load python/3.6 cuda cudnn
SOURCEDIR=~/gae724/src/DeepLabv3FineTuning-master-test
TRAINDIR=~/DeepLabv3FineTuning-master
# Prepare virtualenv
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install --no-index -r $SOURCEDIR/requirements.txt
# References :
## https://expoundai.wordpress.com/2019/08/30/transfer-learning-for-segmentation-using-deeplabv3-in-pytorch/
## https://github.com/msminhas93/DeepLabv3FineTuning
python $TRAINDIR/main.py $TRAINDIR/CrackForest
Annuler une tâche
Reference: https://docs.computecanada.ca/wiki/Running_jobs#Cancelling_jobs
scancel <jobid>
scancel -u vincelf
Vérifier continuellement le statut
while true; do squeue -u vincelf; squeue -u vincelf --format %.15i -h | xargs -I{} seff {}; sleep 30; done
Installer des modules
Pour installer des modules manuellement il faut utiliser l’option –no-index sinon l’installation va essayer de récupérer le module sur l’internet, et les serveurs n’ont pas accès. Exemple: pip install --no-index matplotlib
. Avec le fichier requirements.txt c’est le même requis.
Utiliser Jupyter Notebook
Références:
- https://bioinformatics.computecanada.ca/question/jupyter-notebooks-on-compute-canada-systems/
- https://docs.computecanada.ca/wiki/Jupyter
- https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html
Du côté de CC
Exécuter le script suivant à partir de l’instance de CC. Il faut noter l’installation du module de jupyter.
salloc --account=def-germ2201-ab --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=4800M --time=0-3:00
module load python/3.6
# clean the python virtual pyenv36
rm -rf $SLURM_TMPDIR/pyenv36
# create the virtual env
virtualenv --no-download $SLURM_TMPDIR/pyenv36
# activate the python virtual env
source $SLURM_TMPDIR/pyenv36/bin/activate
# check the python and pip version (3.6)
python --version
pip --version
# install requirements
pip3 install --no-index torch
pip3 install --no-index scikit-learn
pip3 install --no-index scikit-image
pip3 install --no-index keras
pip3 install --no-index tensorflow_gpu
pip3 install --no-index jupyter
# wrapper script that launches Jupyter Notebook
echo -e '#!/bin/bash\nunset XDG_RUNTIME_DIR\njupyter notebook --ip $(hostname -f) --no-browser' > $VIRTUAL_ENV/bin/notebook.sh
chmod u+x $VIRTUAL_ENV/bin/notebook.sh
# starting jupiter notebok
srun $VIRTUAL_ENV/bin/notebook.sh
# url to connect to will be displayed.
# before launching the browser, create a tunnel
Voici le résultat des commandes:
(pyenv36) [vincelf@blg9324 ~]$ srun $VIRTUAL_ENV/bin/notebook.sh
slurmstepd: error: execve(): /localscratch/vincelf.8912482.0/pyenv36/bin/notebook.sh: No such file or directory
srun: error: blg9324: task 0: Exited with exit code 2
(pyenv36) [vincelf@blg9324 ~]$ echo -e '#!/bin/bash\nunset XDG_RUNTIME_DIR\njupyter notebook --ip $(hostname -f) --no-browser' > $VIRTUAL_ENV/bin/notebook.sh
(pyenv36) [vincelf@blg9324 ~]$ chmod u+x $VIRTUAL_ENV/bin/notebook.sh
(pyenv36) [vincelf@blg9324 ~]$ srun $VIRTUAL_ENV/bin/notebook.sh
[I 23:50:45.715 NotebookApp] Serving notebooks from local directory: /home/vincelf
[I 23:50:45.715 NotebookApp] The Jupyter Notebook is running at:
[I 23:50:45.715 NotebookApp] http://blg9324.int.ets1.calculquebec.ca:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e
[I 23:50:45.715 NotebookApp] or http://127.0.0.1:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e
[I 23:50:45.715 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 23:50:45.735 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/vincelf/.local/share/jupyter/runtime/nbserver-109438-open.html
Or copy and paste one of these URLs:
http://blg9324.int.ets1.calculquebec.ca:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e
or http://127.0.0.1:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e
[I 23:54:35.851 NotebookApp] 302 GET /?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e (10.74.73.2) 0.81ms
[I 00:09:46.276 NotebookApp] Kernel started: dae191ee-a2d2-408c-aeeb-08b630758740
Sur l’ordi perso
Ouvrir un terminal et exécuter la commande suivante pour ouvrir un tunnel entre la machine personnelle et le serveur de CC et le Jupyter notebook. Juste entrer le mot de passe sudo local puis ensuite le mot de passe de CC.
lefv2603@lefv2603-desktop:~$ sshuttle --dns -Nr vincelf@beluga.computecanada.ca
[local sudo] Password:
vincelf@beluga.computecanada.ca's password:
client: Connected.
Ouvrir l’URL dans un navigateur. Jupyter notebook apparait. Créer un nouveau notebook ou ouvrir un existant.