Calcul Canada

Infos compte et expiration
Connexion
1. Prérequis
2. Ouvrir une connexion
Job / Tâche
1. Scripts de tests
2. Exécution de la tâche (job), statut, résultat et sommaire
Mode intéractif
Annuler une tâche
Vérifier continuellement le statut
Installer des modules
Utiliser Jupyter Notebook
1. Du côté de CC
2. Sur l’ordi perso

Références:

Infos compte et expiration

Se connecter au site https://ccdb.computecanada.ca/ avec vincent.lefalher@usherbrooke.ca et le mot de passe. L’information la plus pertinente ici est la date d’expiration.

Compte Calcul Canada de Vincent Le Falher (CCI: tvh-181, Nom d'utilisateur: vincelf)

Rôles activés
Identifiant de rôle de Calcul Canada (CCRI): tvh-181-01
Étudiant à la maîtrise, Département de géomatique appliquée, Un. de Sherbrooke, activé
expire le 1 mai 2020

Parrainé par
tpb-973-02 , Mickael Germain: Professeur, Département de géomatique appliquée, Un. de Sherbrooke

Projet avec allocation de ressources
Identifiant du projet (RAPI)	Nom du groupe	Statut	Titre	Allocations	Membre?	Responsable?	Propriétaire?
tpb-973-ab	def-germ2201-ab
Activé	Apprentissage profond en télédétection	3 allocations actives - Hors CAR
Oui	Non	Non

Compte
Rôle	
Personne	Vincent Le Falher (CCI: tvh-181)
Nom d'utilisateur	vincelf
Nom du groupe	vincelf
UID	3095395
GID	3095395
Compte par défaut?	Oui
Créé le	2020-01-23 18:55

Connexion

Prérequis

Vérifier le status des systèmes:

https://status.computecanada.ca/

Il peut arriver qu’un serveur soit indisponible.

lefv2603@lefv2603-jetsonnano:~$ ssh vincelf@cedar.computecanada.ca
The Cedar login nodes are currently down for emergency maintenance. We appologize for the inconvenience.

Ouvrir une connexion

Ouvrir un terminal
faire un nssh sur le serveur désigné. Il y en a qui sont intéressant: cedar (WestGrid), beluga (ETS) et Graham (UofWaterloo)
```
ssh vincelf@cedar.computecanada.ca
ssh vincelf@beluga.computecanada.ca
ssh vincelf@graham.computecanada.ca
```
Saisir le mot de passe qui a été fourni

Une fois connecté, il y a un espace dans home (~) et ~/projects (/home/vincelf/projects/def-germ2201-ab/vincelf). Les scripts peuvent résider dans le home(~) mais la commande doit être exécuté à partir de ~/projects.

Les scripts de tests se trouvent sur Cedar… qui est en cours de maintenance. Espérons qu’ils seront toujours là !

Job / Tâche

Scripts de tests

Référence: https://docs.computecanada.ca/wiki/PyTorch

Créer un emplacement pour les scripts

[vincelf@beluga2 pytorch-test]$ pwd
/home/vincelf/gae724/src/pytorch-test

Assigner les variables d’environnements utilisées par le sbatch dans ~/.bashrc

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
	. /etc/bashrc
fi

# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
export SLURM_ACCOUNT=def-germ2201-ab
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SALLOC_ACCOUNT=$SLURM_ACCOUNT

# User specific aliases and functionsa
alias H='history'
alias Hg='history | grep '

Scripts sbatch de test (adapté)

Note: Les nombres de cpu, gpu, mém et time semble élevé. Mais pour que la tâche puisse avoir une certaine priorité et être exécuté sans trop de délai, il vaut mieux en demander plus. C’est recommandé sur le site de CC.

–gres=gpu is not required for Beluga, as Béluga has only a single type of GPU, so no option can be provided. (https://docs.computecanada.ca/wiki/Using_GPUs_with_Slurm#On_B.C3.A9luga)

#!/bin/bash
## # of Nodes   Node type       CPU cores       CPU memory      # of GPUs       NVIDIA GPU type PCIe bus topology
## 172  Béluga P100 GPU 40      191000M 4       V100-SXM2-16GB  All GPUs associated with the same CPU socket
## one node on Beluga:
## cores        usable memory
## 40   186 GiB (~4.6 GiB/core) +4 GPU
##SBATCH --nodes=1         # will use the 40 cpu, the 4 gpu and the full memory of 186 GiB (~4.6 GiB/core) on Beluga 
##SBATCH --ntasks-per-node=32 # will create 32 tasks per node to execute the job
##SBATCH --ntasks=32 
#SBATCH --gres=gpu:1       # Request GPU "generic resources"; Number of GPU(s) per node
#SBATCH --cpus-per-task=10  # Cores proportional to GPUs: 6 on Cedar, 16 on Graham, 10 on Beluga.; CPU cores/threads
#SBATCH --mem=48000M       # Memory proportional to GPUs: 32000 Cedar, 64000 Graham, 48000M on Beluga.; memory per node; Requesting --mem=0 is interpreted by Slurm to mean "reserve all the available memory on each node assigned to the job."
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

module load python/3.6 cuda cudnn

SOURCEDIR=~/gae724/src/pytorch-test

# Prepare virtualenv
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install --no-index -r $SOURCEDIR/requirements.txt

nvidia-smi

python pytorch-test.py | tee mylog.txt

Fichier requirements.txt pour python

torch
torchvision
torchtext
torchaudio

script de test pytorch

import torch
x = torch.Tensor(5, 3)
print(x)
y = torch.rand(5, 3)
print(y)
# let us run the following only if CUDA is available
if torch.cuda.is_available():
  x = x.cuda()
  y = y.cuda()
  print(x + y)

Exécution de la tâche (job), statut, résultat et sommaire

Référence: https://docs.computecanada.ca/wiki/Running_jobs

Exécution de la tâche (job)

[vincelf@beluga2 pytorch-test]$ sbatch pytorch-test.sh
Submitted batch job 6881869

Statut de la tâche

Note: Ne pas faire squeue sans le user, ni trop souvent La tâche peut être PD (PENDING) ou R (RUNNING)

[vincelf@beluga2 pytorch-test]$ squeue -u vincelf
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON) 
        6881869  vincelf def-germ2201 pytorch-test.s  PD    3:00:00     1   10 gpu:v100:1  48000M  (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 
	
[vincelf@beluga2 pytorch-test]$ squeue -u vincelf
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON) 
        6881869  vincelf def-germ2201 pytorch-test.s   R    2:59:36     1   10 gpu:v100:1  48000M blg5607 (None) 

Sommaîre de la tâche

[vincelf@beluga2 pytorch-test]$ seff 6881869
Job ID: 6881869
Cluster: beluga
User/Group: vincelf/vincelf
State: PENDING
Cores: 1
Efficiency not available for jobs in the PENDING state.

[vincelf@beluga2 pytorch-test]$ seff 6881869
Job ID: 6881869
Cluster: beluga
User/Group: vincelf/vincelf
State: RUNNING
Nodes: 1
Cores per node: 10
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:05:20 core-walltime
Job Wall-clock time: 00:00:32
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 46.88 GB (46.88 GB/node)
WARNING: Efficiency statistics may be misleading for RUNNING jobs.

[vincelf@beluga2 pytorch-test]$ seff 6881869
Job ID: 6881869
Cluster: beluga
User/Group: vincelf/vincelf
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 10
CPU Utilized: 00:00:37
CPU Efficiency: 7.55% of 00:08:10 core-walltime
Job Wall-clock time: 00:00:49
Memory Utilized: 1.46 GB
Memory Efficiency: 3.11% of 46.88 GB

Résultat (output file)

[vincelf@beluga2 pytorch-test]$ ls -al
total 154
drwxr-x--- 2 vincelf vincelf 25600 Apr 16 22:54 .
drwxr-x--- 4 vincelf vincelf 25600 Apr 16 21:29 ..
-rw-r----- 1 vincelf vincelf  4117 Apr 16 22:48 blg5204-6881815.out
-rw-r----- 1 vincelf vincelf  2805 Apr 16 21:38 blg5207-6881291.out
-rw-r----- 1 vincelf vincelf  2805 Apr 16 21:59 blg5606-6881568.out
-rw-r----- 1 vincelf vincelf  4035 Apr 16 22:54 blg5607-6881869.out
-rw-r----- 1 vincelf vincelf   650 Apr 16 22:54 mylog.txt
-rw-r----- 1 vincelf vincelf   203 Apr 16 21:33 pytorch-test.py
-rwxr-x--- 1 vincelf vincelf  1323 Apr 16 22:46 pytorch-test.sh
-rw-r----- 1 vincelf vincelf    39 Apr 16 21:27 requirements.txt

[vincelf@beluga2 pytorch-test]$ cat mylog.txt 
tensor([[3.0598e-37, 0.0000e+00, 3.0598e-37],
        [0.0000e+00, 3.0598e-37, 0.0000e+00],
        [1.8289e+28, 1.6532e+19, 4.0837e-40],
        [1.3435e-30, 1.0995e+27, 4.4909e+21],
        [1.7036e+19, 6.8388e-40, 1.8206e-34]])
tensor([[0.9127, 0.2009, 0.8643],
        [0.4570, 0.4358, 0.2604],
        [0.9498, 0.9427, 0.8709],
        [0.0287, 0.1148, 0.4400],
        [0.2843, 0.4051, 0.9094]])
tensor([[9.1267e-01, 2.0091e-01, 8.6429e-01],
        [4.5696e-01, 4.3575e-01, 2.6037e-01],
        [1.8289e+28, 1.6532e+19, 8.7085e-01],
        [2.8674e-02, 1.0995e+27, 4.4909e+21],
        [1.7036e+19, 4.0509e-01, 9.0939e-01]], device='cuda:0')
	
[vincelf@beluga2 pytorch-test]$ cat blg5607-6881869.out
Using base prefix '/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3'
New python executable in /localscratch/vincelf.6881869.0/env/bin/python
Installing setuptools, pip, wheel...done.
Ignoring pip: markers 'python_version < "3"' don't match your environment
Collecting torch (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 1))
Collecting torchvision (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting torchtext (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting torchaudio (from -r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 4))
Collecting six (from torchvision->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting numpy (from torchvision->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting pillow-simd>=4.1.1 (from torchvision->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 2))
Collecting tqdm (from torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting requests (from torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting certifi>=2017.4.17 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting idna<3,>=2.5 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Collecting chardet<4,>=3.0.2 (from requests->torchtext->-r /home/vincelf/gae724/src/pytorch-test/requirements.txt (line 3))
Installing collected packages: torch, six, numpy, pillow-simd, torchvision, tqdm, certifi, idna, urllib3, chardet, requests, torchtext, torchaudio
Successfully installed certifi-2020.4.5.1 chardet-3.0.4 idna-2.9 numpy-1.18.1 pillow-simd-7.0.0.post3 requests-2.23.0 six-1.14.0 torch-1.4.0 torchaudio-0.2+7e15d2f torchtext-0.3.1 torchvision-0.5.0 tqdm-4.40.2 urllib3-1.25.8
Thu Apr 16 22:54:20 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                    0 |
| N/A   35C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
tensor([[3.0598e-37, 0.0000e+00, 3.0598e-37],
        [0.0000e+00, 3.0598e-37, 0.0000e+00],
        [1.8289e+28, 1.6532e+19, 4.0837e-40],
        [1.3435e-30, 1.0995e+27, 4.4909e+21],
        [1.7036e+19, 6.8388e-40, 1.8206e-34]])
tensor([[0.9127, 0.2009, 0.8643],
        [0.4570, 0.4358, 0.2604],
        [0.9498, 0.9427, 0.8709],
        [0.0287, 0.1148, 0.4400],
        [0.2843, 0.4051, 0.9094]])
tensor([[9.1267e-01, 2.0091e-01, 8.6429e-01],
        [4.5696e-01, 4.3575e-01, 2.6037e-01],
        [1.8289e+28, 1.6532e+19, 8.7085e-01],
        [2.8674e-02, 1.0995e+27, 4.4909e+21],
        [1.7036e+19, 4.0509e-01, 9.0939e-01]], device='cuda:0')

Mode intéractif

Exécuter cette commande

salloc --account=def-germ2201-ab --gres=gpu:1 --cpus-per-task=10 --mem=48000M --time=0-3:00

puis exécuter chaque commande du script une à la fois, sauf les SBATCH

[vincelf@beluga4 DeepLabv3FineTuning-master-test]$ cat DeepLabv3FineTuning-master-test.sh 
#!/bin/bash
## # of Nodes	Node type	CPU cores	CPU memory	# of GPUs	NVIDIA GPU type	PCIe bus topology
## 172	Béluga P100 GPU	40	191000M	4	V100-SXM2-16GB	All GPUs associated with the same CPU socket
## one node on Beluga:
## cores	usable memory
## 40	186 GiB (~4.6 GiB/core)	+4 GPU
#SBATCH --nodes=1         # will use the 40 cpu, the 4 gpu and the full memory of 186 GiB (~4.6 GiB/core) on Beluga 
#SBATCH --ntasks-per-node=8 # will create 32 tasks per node to execute the job
##SBATCH --ntasks=32 
##SBATCH --gres=gpu:4       # Request GPU "generic resources"; Number of GPU(s) per node
##SBATCH --cpus-per-task=10  # Cores proportional to GPUs: 6 on Cedar, 16 on Graham, 10 on Beluga.; CPU cores/threads
##SBATCH --mem=48000M       # Memory proportional to GPUs: 32000 Cedar, 64000 Graham, 48000M on Beluga.; memory per node; Requesting --mem=0 is interpreted by Slurm to mean "reserve all the available memory on each node assigned to the job."
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

module load python/3.6 cuda cudnn

SOURCEDIR=~/gae724/src/DeepLabv3FineTuning-master-test
TRAINDIR=~/DeepLabv3FineTuning-master

# Prepare virtualenv
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install --no-index -r $SOURCEDIR/requirements.txt

# References : 
## https://expoundai.wordpress.com/2019/08/30/transfer-learning-for-segmentation-using-deeplabv3-in-pytorch/
## https://github.com/msminhas93/DeepLabv3FineTuning
python $TRAINDIR/main.py $TRAINDIR/CrackForest

Annuler une tâche

Reference: https://docs.computecanada.ca/wiki/Running_jobs#Cancelling_jobs

scancel <jobid>
scancel -u vincelf

Vérifier continuellement le statut

while true; do squeue -u vincelf; squeue -u vincelf --format %.15i -h | xargs -I{} seff {}; sleep 30; done

Installer des modules

Pour installer des modules manuellement il faut utiliser l’option –no-index sinon l’installation va essayer de récupérer le module sur l’internet, et les serveurs n’ont pas accès. Exemple: pip install --no-index matplotlib. Avec le fichier requirements.txt c’est le même requis.

Utiliser Jupyter Notebook

Références:

Du côté de CC

Exécuter le script suivant à partir de l’instance de CC. Il faut noter l’installation du module de jupyter.

salloc --account=def-germ2201-ab --ntasks=1 --nodes=1 --cpus-per-task=1 --mem=4800M --time=0-3:00

module load python/3.6

# clean the python virtual pyenv36
rm -rf $SLURM_TMPDIR/pyenv36

# create the virtual env
virtualenv --no-download $SLURM_TMPDIR/pyenv36

# activate the python virtual env
source $SLURM_TMPDIR/pyenv36/bin/activate

# check the python and pip version (3.6)
python --version
pip --version

# install requirements
pip3 install --no-index torch
pip3 install --no-index scikit-learn
pip3 install --no-index scikit-image
pip3 install --no-index keras
pip3 install --no-index tensorflow_gpu
pip3 install --no-index jupyter

# wrapper script that launches Jupyter Notebook
echo -e '#!/bin/bash\nunset XDG_RUNTIME_DIR\njupyter notebook --ip $(hostname -f) --no-browser' > $VIRTUAL_ENV/bin/notebook.sh
chmod u+x $VIRTUAL_ENV/bin/notebook.sh

# starting jupiter notebok
srun $VIRTUAL_ENV/bin/notebook.sh

# url to connect to will be displayed. 
# before launching the browser, create a tunnel

Voici le résultat des commandes:

(pyenv36) [vincelf@blg9324 ~]$ srun $VIRTUAL_ENV/bin/notebook.sh
slurmstepd: error: execve(): /localscratch/vincelf.8912482.0/pyenv36/bin/notebook.sh: No such file or directory
srun: error: blg9324: task 0: Exited with exit code 2
(pyenv36) [vincelf@blg9324 ~]$ echo -e '#!/bin/bash\nunset XDG_RUNTIME_DIR\njupyter notebook --ip $(hostname -f) --no-browser' > $VIRTUAL_ENV/bin/notebook.sh
(pyenv36) [vincelf@blg9324 ~]$ chmod u+x $VIRTUAL_ENV/bin/notebook.sh
(pyenv36) [vincelf@blg9324 ~]$ srun $VIRTUAL_ENV/bin/notebook.sh
[I 23:50:45.715 NotebookApp] Serving notebooks from local directory: /home/vincelf
[I 23:50:45.715 NotebookApp] The Jupyter Notebook is running at:
[I 23:50:45.715 NotebookApp] http://blg9324.int.ets1.calculquebec.ca:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e
[I 23:50:45.715 NotebookApp]  or http://127.0.0.1:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e
[I 23:50:45.715 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 23:50:45.735 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///home/vincelf/.local/share/jupyter/runtime/nbserver-109438-open.html
    Or copy and paste one of these URLs:
        http://blg9324.int.ets1.calculquebec.ca:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e
     or http://127.0.0.1:8888/?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e

[I 23:54:35.851 NotebookApp] 302 GET /?token=35b36532b735dd25c49d964e83d74b149478aea5f22ed84e (10.74.73.2) 0.81ms
[I 00:09:46.276 NotebookApp] Kernel started: dae191ee-a2d2-408c-aeeb-08b630758740

Sur l’ordi perso

Ouvrir un terminal et exécuter la commande suivante pour ouvrir un tunnel entre la machine personnelle et le serveur de CC et le Jupyter notebook. Juste entrer le mot de passe sudo local puis ensuite le mot de passe de CC.

lefv2603@lefv2603-desktop:~$ sshuttle --dns -Nr vincelf@beluga.computecanada.ca
[local sudo] Password: 
vincelf@beluga.computecanada.ca's password: 
client: Connected.

Ouvrir l’URL dans un navigateur. Jupyter notebook apparait. Créer un nouveau notebook ou ouvrir un existant.