Part 3: Upgrading the platform#

To upgrade the platform, download and extract the Intel® Geti™ installation package on your Intel® Geti™ server.

Warning

The data folder should not change after installing the Intel® Geti™ platform. Otherwise, you will be unable to view data (project list, dataset, model) in the platform UI.

Warning

Upgrading from Intel Geti 1.8.* to 2.0.x may take several hours. Please do not interrupt this process. Make sure to give it enough time to finish.

Warning

Models that were originally trained using version 1.5.x or earlier and have not been retrained in version 1.8.x will not be automatically migrated to version 2.0.x. To upgrade these models for use in version 2.0.x, they must be retrained in the 2.0.x environment.

tar -xf package.tar.gz

Go into the folder created by the previous command:

cd platform_<VERSION>

Now, you can choose between two modes of upgrading:

  • with wizard - where the installer asks a user for all data needed during the upgrade process

  • with configuration file - where a user must fill in all the data prior to running the installer

In the following, we will describe the two upgrade modes.

Upgrade with wizard#

To start upgrading the platform with wizard run the following command:

sudo ./platform_installer upgrade

You will be asked to provide the password if you are not logged in as a passwordless sudo user.

When running in the wizard mode, the installer will prompt for the following data:

  1. Grafana stack configuration

Do you want to install the Grafana stack (not recommended on setups meeting only the minimum HW requirements)? [y/N]

If the answer is ‘y’, the Grafana stack is installed on the platform. If the answer is ‘N’, the Grafana stack is not installed on the platform.

  1. Backup configuration

The installer checks if there is enough space to perform the data folder backup, if not, the installer will prompt if the backup should be skipped or not:

You can either skip the backup (highly not recommended), or provide a path to a folder to copy the data.
Do you want to skip the backup? [y/N]

If the answer is ‘y’, the backup of the data folder will be skipped. If the answer is ‘N’, the backup will be performed, and installer will prompt for the new backup location:

Path to folder to copy the backup data:

The installer checks if there is enough space under the provided backup location.

  1. Final confirmation

The installer displays the data it gathered and asks for confirmation. If it is given, it starts the upgrade process. If not, the upgrade is aborted.

  1. The installer takes provided kubeconfig and starts the upgrade process.

Watch for warnings or error messages displayed during the process and act accordingly.

Upgrade with configuration file#

If the –config-file parameter is provided the installation package starts the upgrade without the user interaction.

The configuration file is a yaml file with all configuration data specified in advance. The template for the file can be found in the installation package in the platform folder and it has the following initial content:

# Specifies whether the Grafana stack will be installed on the platform.
# Not recommended on setups meeting only the minimum HW requirements.
# Allowed values: true or false
# If set to true, the Grafana stack will be installed on the platform.
# If set to false, or entry is missing, the Grafana stack will not be installed on the platform.
grafana_enabled:
# Specifies whether the backup should be skipped, if the available space in the data folder is insufficient to create the backup.
# If the data folder available space is sufficient, skip_backup and backup_location are ignored.
# Allowed values: true or false
# If set to true, the backup will be skipped (highly not recommended).
# If set to false, the backup will not be skipped and the `backup_location` has to be provided.
     skip_backup:
# Specifies the new backup location, if the available space in the data folder is insufficient and the backup is not skipped.
# Please refrain from removing the backup location until the upgrade process is finished.
     backup_location:

Sample file for the upgrade:

grafana_enabled: false

Upgrading the Intel® Geti™ platform offline#

To run the offline upgrade, first all the offline data must be collected, and then the offline upgrade has to be executed.

K3s offline upgrade#

Follow the steps for air-gap upgrade to upgrade k3s to v1.29.3+k3s1.

When running the script (mentioned in the command as <k3s_script_name>) again from https://get.k3s.io (point 2 from the steps above), provide the following environment variables:

sudo INSTALL_K3S_SKIP_DOWNLOAD=true INSTALL_K3S_VERSION=v1.29.3+k3s1 INSTALL_K3S_EXEC="--disable traefik --disable runtimes
       --service-node-port-range 30000-30050
       --kube-apiserver-arg=enable-admission-plugins=NodeRestriction,ServiceAccount
       --kube-apiserver-arg=request-timeout=5m0s
       --kube-apiserver-arg=audit-log-path=/var/lib/rancher/k3s/server/logs/audit.log
  --kube-apiserver-arg=tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
       --kube-apiserver-arg=audit-log-maxbackup=10
       --kube-controller-manager-arg=terminated-pod-gc-threshold=1000
       --kube-controller-manager-arg=leader-elect-lease-duration=30s
       --kube-controller-manager-arg=leader-elect-renew-deadline=20s
       --kube-controller-manager-arg=bind-address=127.0.0.1
       --kube-scheduler-arg=bind-address=127.0.0.1
       --kubelet-arg=streaming-connection-idle-timeout=5m
       --kubelet-arg=rotate-server-certificates=true
       --kubelet-arg=eviction-hard=imagefs.available<5Gi,memory.available<100Mi,nodefs.available<5Gi,nodefs.inodesFree<5%
       --kubelet-arg=eviction-minimum-reclaim=imagefs.available=1Gi,nodefs.available=1Gi
       --kubelet-arg=image-gc-low-threshold=0" ./<k3s_script_name>

Requirements#

Make sure you have installed:

  • skopeo

  • the same version of python on both machines - the one with access to the Internet and the one where the installer is executed

  • python3-venv

  • python3-pip (ensure that on both machines - the one with access to the Internet and the one where the installer is executed - pip is in the same version)

  • other prerequisites listed in the SW Requirements section the document still apply

For version of Ubuntu 20.10 or newer, run:

sudo apt install skopeo

For Ubuntu 20.04 or lower, manual build and installation of the library must be performed:

  1. Install Go for Linux using these instructions.

  2. Install the following libraries:

sudo apt update -y

sudo apt install -y libgpgme-dev libassuan-dev libbtrfs-dev libdevmapper-dev pkg-config go-md2man python3-venv python3-pip
  1. If the GOPATH environment variable is not set, create a GOPATH directory (for example: make “go” directory in your home directory and set GOPATH variable to $HOME/go directory using

export GOPATH=$HOME/go
  1. Download skopeo:

cd $GOPATH && wget https://github.com/containers/skopeo/archive/refs/tags/v1.4.1.tar.gz && tar -xvzf v1.4.1.tar.gz
  1. Install skopeo:

cd skopeo-1.4.1/ && make bin/skopeo && sudo PATH=$PATH GOPATH=$GOPATH make install

Offline upgrade#

  1. Migrate the Intel® Geti™ installation package and extract it on your offline server.

tar -xf package.tar.gz
  1. Go into the folder created by the previous command:

cd platform_<VERSION>/installer
  1. Execute ./prepare_offline.sh script with root privileges:

   sudo ./prepare_offline.sh

It creates a list of required dependencies that have to be installed to run offline upgrade.
  1. Copy the platform _<VERSION> folder containing the list generated in the previous step to the machine with access to the Internet.

  2. Execute the download_offline.sh script on the machine with access to the Internet.

sudo ./download_offline.sh

The script is in the folder copied in the previous step and it downloads all required dependencies.

  1. Copy the platform _<VERSION> folder from the machine with access to the Internet to the machine from which the installation script will be run.

  2. Configure OFFLINE_DEPENDENCIES_LOCATION environment variable providing the absolute path

(e.g., export OFFLINE_DEPENDENCIES_LOCATION=/home/user/platform _<VERSION> /installer/offline) which points to the prepared directory with dependencies. Dependencies are kept by default in the platform _<VERSION>/installer/offline folder.

Now, you can choose between two modes of upgrading:

  • with wizard - where the installer asks a user for all data needed during the upgrade process

  • with configuration file - where a user must fill in all the data prior to running the installer

To enter the wizard mode execute:

sudo ./platform_installer upgrade-offline

To pass the parameters via a config file execute:

sudo ./platform_installer upgrade-offline --config-file my_config.yaml

You will be asked to provide the password if you are not logged in as a passwordless sudo user. The input asked by the wizard, or to be provided in the configuration file is the same as for regular upgrade (described in the section above).

NVIDIA Container Toolkit offline upgrade#

The NVIDIA Container Toolkit enables users to build and run GPU-accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs. The steps below describe how to upgrade the container toolkit in offline environment.

Step 1: Preparing the packages

  1. Configure the NVIDIA production repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  1. Update the packages list from the repository:

sudo apt-get update -y
  1. Download the NVIDIA Container Toolkit packages:

sudo apt-get install -y --allow-downgrades --download-only nvidia-container-toolkit=1.14.5-1 nvidia-container-toolkit-base=1.14.5-1
  1. Package the download apt packages

tar -cvzf nvidia-container-toolkit.tar.gz /var/cache/apt/archives/nvidia-container-toolkit*

Step 2: Offline installation

  1. Transfer nvidia-container-toolkit.tar.gz to your offline server.

  2. Extract the package archive

sudo tar -xvzf nvidia-container-toolkit.tar.gz -C /
  1. Install the packages using apt:

sudo apt-get install -y nvidia-container-toolkit=1.14.5-1 nvidia-container-toolkit-base=1.14.5-1
  1. Configure NVIDIA Container Runtime. Edit /etc/nvidia-container-runtime/config.toml and make sure the following settings are set as below

accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false
  1. Restart k3s instance

sudo systemctl restart k3s

Troubleshooting#

  1. If you encounter the error ‘Due to the dpkg being busy it’s impossible to finish the installation’, stop your system update and then, rerun the installer.

  2. If a user forgets his/her password on an instance that does not have an SMTP (mail) server configured, run the ami_update_password.sh script from the installation package to generate a new password for the user.

  3. If, in rare cases, your installation or upgrade fails with apt-get related errors found in the ansible.log, ensure apt-get is configured properly, run:

sudo apt-get update

and rerun `platform_installer`
  1. To investigate problems during the installation, the following logs are available:

Location

Contents

platform_logs/install.log

Installation, gathering data, executing external tools

platform_logs/k3s_install.log

Installation of k3s

platform_logs/k3s_uninstall.log

Uninstallation of k3s

platform_logs/ansible.log

Platform installation/upgrade/uninstallation