Kubeflow Setup
Back | Tool Guide | Sharing Guide |
Kubeflow is an open-source Cloud Native platform for machine learning.
The Kubeflow project is dedicated to making deployments of machine learning workflows on Kubernetes simple, portable and scalable, providing a straightforward way to deploy systems for ML to diverse infrastructures. Kubeflow comes with several useful components, including JupyterHub, and has support for GPU-accelerated compute. Check out the official documentation at kubeflow.org.
Page Index
- Installing/Configuring Docker and Kubernetes (nvidia-distribution)
- Installing Kubeflow with Minikube (Single Node Only!)
- Installing Kubeflow on Kubernetes (Multiple Nodes)
- Customising the Spawner GUI
- Common Issues
- GitHub Rate Limit
- Kubernetes Dashboard
- Common Dataset Volume
- JupyterLab
- No Internet access for Notebook
- Culling Idle Notebook
- Using NGC Images
Installing Kubenetes
Prerequisites:
nvidia-docker
2.0 is installed and properly configured- User is added to the Docker group (allow Docker to run without sudo)
The official install guide for Kubernetes on NVIDIA GPUs (KONG) is available from NVIDIA.
However, we will not be following the full guide here.
To allow Kubeflow to work without issues, we have to set the default container runtime to NVIDIA’s.
sudo nano /etc/docker/daemon.json
Add the following line under the first level:
"default-runtime": "nvidia",
Your file should end up looking something like this:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Do sudo pkill -SIGHUP dockerd
to restart the Docker daemon.
To test: docker run --rm nvidia/cuda nvidia-smi
Installing Kubeflow with Minikube (Single Node Only!)
Minikube is a tool that makes it easy to run Kubernetes locally on a single node. This is very useful if, for example, you can a 4 GPU machine (e.g. DGX Station) that you are sharing among a small team, or have many infrequent users that may need to use it (e.g. in a University).
Install Minikube
Run the following commands:
curl -Lo kubectl https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && sudo cp kubectl /usr/local/bin/ && rm kubectl
curl -Lo minikube https://storage.googleapis.com/minikube/releases/v0.28.2/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
export MINIKUBE_WANTREPORTERRORPROMPT=false
export MINIKUBE_HOME=$HOME
export CHANGE_MINIKUBE_NONE_USER=true
export KUBECONFIG=$HOME/.kube/config
sudo -E minikube start --vm-driver=none --feature-gates=DevicePlugins=true
In case you get an error about Docker version:
sudo minikube delete
sudo service docker stop
sudo apt install docker-ce=18.06.1~ce~3-0~ubuntu
sudo apt install nvidia-docker2=2.0.3+docker18.06.1-1 nvidia-container-runtime=2.0.0+docker18.06.1-1
And then create a new minikube cluster:
sudo -E minikube start --vm-driver=none --feature-gates=DevicePlugins=true
If all went well, you should see the following outout:
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
===================
WARNING: IT IS RECOMMENDED NOT TO RUN THE NONE DRIVER ON PERSONAL WORKSTATIONS
The 'none' driver will run an insecure kubernetes apiserver as root that may leave the host vulnerable to CSRF attacks
Create the NVIDIA daemonset (driver)
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
Output: daemonset "nvidia-device-plugin-daemonset" created
Please wait for about 30 seconds and then check if the GPUs on your node can be accessed from minikube:
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
minikube 4
Install ksonnet
You might want to check the latest ksonnet release on the GitHub releases page.
wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz
tar -xzf ks_0.13.1_linux_amd64.tar.gz
sudo cp ks_0.13.1_linux_amd64/ks /usr/local/bin/
Install Kubeflow using ksonnet
Environment Setup
# Create a namespace for kubeflow deployment
NAMESPACE=kubeflow
kubectl create namespace ${NAMESPACE}
# https://github.com/kubeflow/kubeflow/releases
VERSION=v0.2.7
# Initialize a ksonnet app. Set the namespace for its default environment.
APP_NAME=my-kubeflow
ks init ${APP_NAME}
cd ${APP_NAME}
ks env set default --namespace ${NAMESPACE}
Install Kubeflow components
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow/core@${VERSION}
Only core
is a must to install. The rest are extras:
- tf-serving: TensorFlow Serving
- argo: Workflows for Kubernetes
- katib: Hyperparameter Tuning on Kubeflow
- seldon: Deploy Models on Kubeflow
- mpi-job: Allreduce-style distributed training
- pytorch-job: PyTorch Operator
ks pkg install kubeflow/tf-serving@${VERSION}
ks pkg install kubeflow/argo@${VERSION}
ks pkg install kubeflow/katib@${VERSION}
ks pkg install kubeflow/seldon@${VERSION}
ks pkg install kubeflow/mpi-job@${VERSION}
ks pkg install kubeflow/pytorch-job@${VERSION}
ks pkg install kubeflow/examples@${VERSION}
Final Steps
# Create templates for core components
ks generate kubeflow-core kubeflow-core
# Enable collection of anonymous usage metrics
# Skip this step if you don't want to enable collection.
ks param set kubeflow-core reportUsage true
ks param set kubeflow-core usageId $(uuidgen)
ks param set kubeflow-core jupyterHubServiceType NodePort
#ks param set kubeflow-core jupyterNotebookPVCMount "null"
# Deploy Kubeflow
ks apply default -c kubeflow-core
After running the setup steps, use kubectl get pods -n kubeflow
to check if all the pods have been succesfully created.
To expose JupyterHub on your machine’s IP address:
PODNAME=`kubectl get pods --namespace=${NAMESPACE} --selector="app=tf-hub" --output=template --template=""`
kubectl expose pod $PODNAME --type=NodePort --name tf-service --namespace kubeflow
Installing Kubeflow on Kubernetes (Multiple Nodes)
TODO
Customising the Spawner GUI
The gist of it is edit the jupyterhub-config
ConfigMap in order to change the configuration file that is loaded by JupyterHub (jupyterhub_config.py
).
You may use the Kubernetes Dashboard. You will need to change the namespace to kubeflow
in order to see the correct set of resources.
Or kubectl
:
kubectl edit configmap jupyterhub-config -n kubeflow
To change the form itself, you need to edit the HTML5 content that is returned by KubeFormSpawner._options_form_default()
.
The rest of the file controls various other JupyterHub configuration.
Remarks
- If you wish to use nano instead of the default
vim
, you can set the editor by prependingKUBE_EDITOR="nano"
to the command above. - For making complex changes to the file, it is recommended to open the ConfigMap in a proper editor, such as VS Code.
- Sample
jupyterhub_config.py
. - KubeSpawner API Reference: KubeSpawner
Common Issues
Rate Limit from GitHub
prepend a github token to the command
GITHUB_TOKEN=xxxXXXxxx ks <command>
Kubernetes Dashboard
Find out more about Kubernetes Dashboard.
- Minikube
sudo minikube dashboard &
The dashboard will be available on localhost:30000
.
- Other
Please see Kubernetes Dashboard GitHub repository.
Common Dataset Volume
To create a shared volume for datasets:
wget https://raw.githubusercontent.com/NVAITC/workstation-setup-guide/master/kubeflow_files/dataset_volume.yaml
nano dataset_volume.yaml
and edit the volume size to your requirementskubectl create -f dataset_volume.yaml -n kubeflow
Modify the jupyterhub_config.py
file in jupyterhub-config
ConfigMap:
# INSERT UNDER
# if pvc_mount and pvc_mount != 'null':
....
....
volumes.append(
{
'name': 'volume-datasets',
'persistentVolumeClaim': {
'claimName': 'claim-datasets'
}
}
)
volume_mounts.append(
{
'mountPath': pvc_mount+'/datasets',
'name': 'volume-datasets'
}
)
....
....
Using JupyterLab instead
Replace ?tree
in the URL with lab
. (Yes, it’s that easy.)
No Internet access from Jupyter Notebooks
Apply the following config. You can do so via GUI (Kubernetes Dashboard)
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-dns
namespace: kube-system
data:
upstreamNameservers: |
["8.8.8.8"]
Culling Idle Notebooks
Add the following to the JupyterHub configmap.
c.JupyterHub.services = [
{
'name': 'wget-cull-idle',
'admin': True,
'command': ['wget', 'https://raw.githubusercontent.com/jupyterhub/jupyterhub/master/examples/cull-idle/cull_idle_servers.py', '-N']
},
{
'name': 'cull-idle',
'admin': True,
'command': ['python', 'cull_idle_servers.py', '--timeout=3600']
}
]
NGC Images
Coming soon