From 0cc3c3d975c81fc8a0f09f0e0c97805155f8c8ee Mon Sep 17 00:00:00 2001
From: David Anton <d.anton@tu-braunschweig.de>
Date: Wed, 8 Feb 2023 20:15:28 +0100
Subject: [PATCH] Add chapter on KI4ALL cluster.

---
 _toc.yml                            |   4 +
 content/03_cluster/0_intro.md       |  27 ++++++
 content/03_cluster/1_singularity.md | 124 ++++++++++++++++++++++++++++
 content/03_cluster/2_slurm.md       |  76 +++++++++++++++++
 4 files changed, 231 insertions(+)
 create mode 100644 content/03_cluster/0_intro.md
 create mode 100644 content/03_cluster/1_singularity.md
 create mode 100644 content/03_cluster/2_slurm.md

diff --git a/_toc.yml b/_toc.yml
index fd2315e..4a0c210 100644
--- a/_toc.yml
+++ b/_toc.yml
@@ -14,4 +14,8 @@ chapters:
   - file: content/02_docker/1_setup
   - file: content/02_docker/2_usage
   - file: content/02_docker/3_vscode
+- file: content/03_cluster/0_intro
+  sections:
+  - file: content/03_cluster/1_singularity
+  - file: content/03_cluster/2_slurm
 
diff --git a/content/03_cluster/0_intro.md b/content/03_cluster/0_intro.md
new file mode 100644
index 0000000..f86ff47
--- /dev/null
+++ b/content/03_cluster/0_intro.md
@@ -0,0 +1,27 @@
+# KI4All-Cluster
+
+The **KI4ALL-Cluster** provides a number of high-performance GPUs on which simulations can be run in the context of research and student projects. The resources on the cluster are managed by the [Slurm Workload Manager](https://slurm.schedmd.com/documentation.html) (short: *Slurm*). Simulations are run on the cluster in [SingularityCE](https://sylabs.io/singularity/) (short: *Singularity*) containers, which package up your software together with all required dependencies. This makes your software portable, reproducible and independent of the operating system.
+
+In the following sections, we give an introduction to Slurm and Singularity. However, it is necessary that you also deal with the referred documentations. 
+
+:::{note}
+Running code on the cluster also requires basic knowledge of using your *terminal*. There is another chapter in this *Knowledge Base* on this topic which might be helpful.
+:::
+
+## Log in to the cluster
+In order to run your simulation on the cluster, you must first log in to the cluster using *SSH* with the following command:
+
+```{code-block} bash
+$ ssh username@ki4alllogin.irmb.bau.tu-bs.de
+```
+
+You will then be prompted to enter your *password*. To log in, please use the **username** and **password** of your **TU Braunschweig account**, which you also use to log in to other TU Braunschweig services, such as Stud.IP. If you do not have an account for the cluster yet, please contact the supervisor of your student project.
+
+## Clone your git repository
+After you have successfully logged in to the cluster, you can clone your *git repository* to your home directory. It is useful to create a new subdirectory beforehand into which you then clone the repository. 
+
+The Singularity definition file should definitely be part of your repository. However, due to the file size, you should avoid storing the Singularity image file in your repository and letting git track it. Depending on the necessary dependencies of your software, the image can be several gigabytes in size. The differences between the Singularity definition and image file are explained in the next sections.
+
+:::{note}
+This *Knowledge Base* also provides a chapter introducing the *git version control system*.
+:::
\ No newline at end of file
diff --git a/content/03_cluster/1_singularity.md b/content/03_cluster/1_singularity.md
new file mode 100644
index 0000000..16deeec
--- /dev/null
+++ b/content/03_cluster/1_singularity.md
@@ -0,0 +1,124 @@
+# Singularity
+
+[SingularityCE](https://sylabs.io/singularity/) (short: *Singularity*) is a container platform that allows you to containerize your software. *Conatainers* package your software and make it reproducible, portable and indepentend of the operating system since all software dependencies are packaged alongside with your software in the conatainer. [1]
+
+The container image is defined by the *Singularity definition file*, which is like a blue print specifying how to build the container. In the next step, an *Singularity image file* is built from this definition file. The image can finally be used on the cluster, for example, to run a simulation. The individual steps are explained in more details below. For a comprehensive introduction to the Singularity container technology, we refer to [1].
+
+
+## Basics
+For this introduction, it might be helpful to understand the following basics:
+- The effective user inside a container is the same as the user outside who ran the container on the host system. This means that the user inside the container has the same access to files and devices as the user outside the container. The access is controlled by standard *POSIX permissions*. [2]
+- By default, when running a container, the *user's home directory*, */tmp* directory and the *current working directory* are mounted into each container. The current working directory inside a container after it is started is the same as the current working directory on the host system from which the container was run.  [1]
+
+
+## Definition file
+Here is an exemplary, minimal *Singularity definition file* (`example.def`):
+
+```{code-block} singularity
+Bootstrap: docker
+From: python:latest
+
+%files
+    path/to/requirements.txt /data/requirements.txt
+    path/to/app /data/app
+
+%post
+    #Environment variables (only for build time) 
+    export CONTAINER_HOME=/data
+
+    #Install Python requirements
+    pip install --upgrade pip
+    pip install -r $CONTAINER_HOME/requirements.txt
+
+    #Set permissions
+    chmod 777 $CONTAINER_HOME
+
+%runscript
+    exec python /data/app/simulation.py
+```
+
+The first two lines belong to the *header* and specify the base container image to start from. Usually, a custom container image is not defined completely from scratch, but builds on top of existing container images. These container images provide, for example, the operating system and certain software dependencies. In the above example, we build on a container image with a Linux operating system that has also the latest Python version pre-installed. The Python container image is pulled from the [Docker Hub](https://hub.docker.com). There are also other container registries available from which container images can be pulled as base operating systems. Please check the Singularity documentation [1] for more information.
+
+The rest of the definition file is structured in *sections*. A short description of all available sections follows:
+|Section|Description|
+|:---|:---|
+|`%setup`|The commands in this section are executed on the host system outside the container. Be careful with this section since the commands can potentially damage the host system!|
+|`%files`|This is the section to copy files (and also directories) from the host system into the container.|
+|`%app`|Singularity allows to install apps within internal modules based on the concept of [Scientific File System (SCIF)](https://sci-f.github.io). This feature is helpful if you have multiple apps with nearly the same dependencies.|
+|`%post`|This is the section where you can download and install dependencies, create new directories or set environment variables at build time.|
+|`%test`|Here you can validate the container at the end of the build process.|
+|`%environment`|This section allows you to define environment variables that are set at runtime. Note that these variables are not available at build time.|
+|`%startscript`|The contents of this section is written to a file within the container which is executed when the `start` command is invoked.|
+|`%runscript`|The contents of this section is written to a file within the container which is executed when the container image is run (either by the `run` command or by executing the container directly as a command).|
+|`%labels`|This section is used to add metadata to the container.|
+|`%help`|In this section, you can give help on how to use the container. The text written in this section will be displayed when the `run-help` command is invoked.|
+
+Note that the sections are executed in the above order during the build process. The order in which the sections are listed in the definition file does not matter. Furthermore, not everey section necessarily has to be included in the definition file.
+
+In the above example, we build on a Python container image. First, in the `%file` section, we copy the *app* directory containing our code and the *requirements.txt* file from the host system to the container. In the `%post` section, we define the environment variable `CONTAINER_HOME`. Note that this environment variable is only available during build time. We then install the requirements in the container using the Python package manager *pip*. Last but not least, we change the permissions of the *app* directory so that all users can read, write and execute the files in this folder. In the spirit of reproducibility, this allows users with different UIDs and GIDs to execute the code in the container. Finally, the `%runscript` section defines that `simulation.py` should be executed when the `run` command is invoked or the container is executed directly.
+
+The definition file can be written in any text or code editor and must have the file ending `.def`.
+
+
+## Building a container image
+It is strongly recommended to build the Singularity container image on the login node.
+
+The Singularity container image can be built from the Singularity definition file with the following command:
+```{code-block} bash
+$ singularity build --fakeroot example.sif path/to/example.def
+```
+
+Here, *example.sif* ist the name of the built image file and *path/to/example.def* the path to the definition file from which the image should be built. 
+
+The `--fakeroot` option enables the fakeroot feature (also referred as *rootless mode*). This feature allows unprivileged users to run a container or container build as *fake root* by leveraging user namespace UID/GID mapping. With the user namespace mapping, you remain your own user on the host system but become a *fake root* user in the container build. For more information on this feature, we refer to the Singularity documentation [1]. 
+
+Beside the `--fakeroot` option, there are a lot of other build options available, see [1]. The most important build options include:
+|Option|Description|
+|:---|:---|
+|`--fakeroot`|Enables fakeroot feature (see above).|
+|`--force`|This option will delete and overwrite an existing Singularity image file with the same file name.|
+|`--bind`|Allows you to mount a directory or file during build time.|
+
+The built Singularity container image is nothing more than a file, also called Singularity image file, with the file ending `.sif`.
+
+It is recommended to save the image file on the same level where the folder with your code is located. You should avoid saving the file in your repository.
+
+
+## Running simulations inside the container
+
+In the above example, you can run your simulation inside the container with the following command:
+```{code-block} bash
+$ singularity run --nv example.sif
+```
+
+The same result could be achieved with the `exec` command:
+```{code-block} bash
+$ singularity exec --nv example.sif /data/app/simulation.py
+```
+
+The `--nv` option enables *Nvidia GPU support* and is necessary to run simulations on Nvidia GPUs. Other important options for running or executing containers include, among others:
+|Option|Description|
+|:---|:---|
+|`--nv`|Enables Nvidia GPU support (see above).|
+|`--bind`|With this option you can specify bind paths under the format `src:dest`, where `src` and `dest` are outside and inside paths, respectively.|
+
+For a complete list of all options, we refer to the Singularity documentation [1]. 
+
+In addition to the two options mentioned above, there are other ways to run a simulation inside a container, in particular by defining application modules based on the Scientific File System (SCIF), see [1].
+
+:::{note}
+Care should be taken that the container is never executed directly on the login node. The above commands to run the simulation in the container are always executed via the slum script and never directly in the terminal.
+:::
+
+## Further help
+In addition to the documentation [1], you can get more help for each command and a list of all available options by entering the following command in the terminal:
+```{code-block} bash
+$ singularity help command
+```
+
+
+## References:
+||Reference|
+|:---:|:---|
+|[1]| *SingularityCE User Guide*, version 3.9, released 6 February 2023, accessed 8 February 2023, https://docs.sylabs.io/guides/main/user-guide.pdf|
+|[2]| *SingularityCE Admin Guide*, released 26 January 2023, https://docs.sylabs.io/guides/main/admin-guide.pdf|
\ No newline at end of file
diff --git a/content/03_cluster/2_slurm.md b/content/03_cluster/2_slurm.md
new file mode 100644
index 0000000..79943b3
--- /dev/null
+++ b/content/03_cluster/2_slurm.md
@@ -0,0 +1,76 @@
+# Slurm
+
+[Slurm Workload Manager](https://slurm.schedmd.com/documentation.html) (short: *Slurm*) is a cluster management and job scheduling system for large and small Linux clusters [1]. 
+
+In this section, we list the most important commands for submitting *Slurm jobs* on the cluster. In addition, we show a minimal example for a *Slurm job script*. For a complete documentation including all commands and covering more edge cases, we refer to [1].
+
+
+## Most important commands
+**Submit a job:**
+```{code-block} bash
+$ sbatch example.sh
+```
+The Slurm job is defined in a shell script (here: `example.sh`). In the next subsection, there is a minimal example for a Slurm job script.
+
+**Report the state of submitted jobs:**
+```{code-block} bash
+$ squeue
+```
+This command informs you, among other things, about the status of your job (*R=running*, *PD=pending*, *C=canceled*) and how long it has been running.
+
+**Submit a job:**
+```{code-block} bash
+$ scancel job-ID
+```
+Each job has its own *job ID*. You get the ID of your job as terminal output when you submit the job.
+
+:::{note}
+In the Slurm documentation [1], you will find many more commands that might be helpful in some cases. In addition, some commands have additional filtering, sorting and formatting options which are also listed in the referred documentation.
+:::
+
+
+## Example slurm job script
+Here is an exemplary Slurm job script:
+
+```{code-block} bash
+#!/bin/bash
+#SBATCH --partition=gpu
+#SBATCH --nodes=1
+#SBATCH --time=1:00:00
+#SBATCH --job-name=test_jax
+#SBATCH --ntasks-per-node=1
+#SBATCH --gres=gpu:1
+
+singularity exec --nv example.sif python example.py 
+```
+
+The first line `#!/bin/bash` is strictly necessary so that the shell interprets the Slurm job file correctly. The following lines define, among other things, how many resources Slurm should allocate for the job. The required resources etc. are defined via options, which are always preceded by `#SBATCH`.
+
+A list of the most important options follows:
+|Option|Description|
+|:---|:---|
+|`--partitition`|Sets the partitition (queue) to be used. On the KI4All-Cluster, only the *gpu* queue is available.|
+|`--nodes`|Sets the number of nodes to be used.|
+|`--time`|Sets the maximum time the job runs before being cancelled automatically.|
+|`--job_name`|Sets the job name. This is also displayed when the status of the job is queried with the command `squeue` and can help with identifying your job.|
+|`--ntasks-per-node`|Sets the number of tasks to be run per node. This becomes interesting if parallel computations are to be performed on the node.|
+|`--gres`|Specifies a comma-delimited list of generic consumable resources. The format of each entry on the list is *name:count*. The name is that of the consumable resource (here: *gpu*). The count is the number of those resources (defaults to 1). When submitting Slurm jobs on the KI4All-Cluster, under normal circumstances, you can include this option unchanged in your script.|
+
+The last line defines what the job should do. In this case, the Python script `example.py` is to be executed in the Singularity container `example.sif`. An introduction to Singularity can be found in the previous section.
+
+:::{note}
+There are some other options that might be helpful for you in some cases. For a complete list of all options and their meaning, we refer to [1].
+:::
+
+
+## Further help
+In addition to the documentation [1], you can get more help for each command and a list of all available options by entering the following command in the terminal:
+```{code-block} bash
+$ command --help
+```
+
+
+## References:
+||Reference|
+|:---:|:---|
+|[1]| *Slurm Workload Manager Documentation*, last modified 5 Octobre 2022, accessed 8 february 2023, https://slurm.schedmd.com/documentation.html|
\ No newline at end of file
-- 
GitLab