k8s stands for kubernetes, a few kubernetes related projects, cautionary tales and discussions are listed below.
talos and k8s
https://www.talos.dev/v1.1/introduction/getting-started/
https://www.siderolabs.com/platform/talos-os-for-kubernetes/ https://www.siderolabs.com/blog/launching-talos-systems/ https://kubernetespodcast.com/episode/159-talos/
alpine and k8s
https://dev.to/xphoniex/how-to-create-a-kubernetes-cluster-on-alpine-linux-kcg https://techviewleo.com/install-kubernetes-on-alpine-linux-with-k3s/
coreos
https://medium.com/@todd_78449/goodbye-alpine-linux-my-dear-friend-23ca0f30a6a8
containeros
flatcar
photonos
k3os
k3s, k0s, kind and microk8s
There are certified distributions which are not too resource hungry, especially if you need to self-host clusters, for example kind (kind.sigs.k8s.io),k3s (https://k3s.io/), k0s (https://k0sproject.io/) and microk8s. A list of excellent comparisons can be found here: https://blog.flant.com/small-local-kubernetes-comparison/ https://blog.radwell.codes/2021/05/best-kubernetes-distribution-for-local-environments/
kubeflow
Kubeflow makes deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
gke autopilot
https://polyaxon.com/ deployed on GKE can be used to queue.
https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview
floydhub
neuraxle
clearml
Auto-magical CI/CD to streamline an ML workflow. Experiment Manager, MLOps and data-Management
https://github.com/allegroai/clearml
mlflow
Open source platform for the machine learning lifecycle. one part experiment manager https://mlflow.org/
trains
One part experiment tracking similar to MLFLOW, and one part (trains-agent) that helps schedule jobs into GPU servers.
kubernetes work queue
use a kubernetes work queue and just have them queue jobs.
Redis or rabbitMQ can be used to store the queue and then k8s can be used to spawn jobs from that.
https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/
apache airflow
https://airflow.apache.org/docs/stable/integration.html#gcp
minikube
minikube quickly sets up a local Kubernetes cluster on macOS, Linux, and Windows. Needs container or virtual machine manager, such as: Hyperkit, Hyper-V, KVM or Podman. It runs on Linux arm64 and with tweaks with Alpine.
In the case of kind, k3d, and minikube, you can go for one Linux VM (for a basic cluster). You can also use minikube by running Talos directly in Docker or QEMU with “talosctl”. More importantly can cache: https://minikube.sigs.k8s.io/docs/handbook/offline/
dkube
dkube is commercial an End-to-End MLOps Platform
nvidia deepops
https://github.com/NVIDIA/deepops
podman
https://docs.podman.io/en/latest/ podman can be installed on alpine and talos.
portainer and rancher
Rancher (https://rancher.com/) or Portainer (https://www.portainer.io/) for easier management and/or dashboard functionality. For example, you can create a deployment through the UI by following a wizard that also offers you configuration that you might want to use (e.g. resource limits) and then later retrieve the YAML manifest. They also make working with pre-made Helm charts (packages) more easy.
hn discussions
https://news.ycombinator.com/item?id=31795160
https://news.ycombinator.com/item?id=31845746 https://news.ycombinator.com/item?id=27919105
is slurm still relevant in a k8s age?
https://www.determined.ai/blog/slurm-lacking-deep-learning
https://www.run.ai/guides/slurm
https://www.dkube.io/blog/2021-04-06/
failure stories
https://news.ycombinator.com/item?id=32596816
https://news.ycombinator.com/item?id=20163500