k8s stands for kubernetes, a few kubernetes related projects, cautionary tales and discussions are listed below.

talos and k8s

https://www.talos.dev/v1.1/introduction/getting-started/

https://www.siderolabs.com/platform/talos-os-for-kubernetes/ https://www.siderolabs.com/blog/launching-talos-systems/ https://kubernetespodcast.com/episode/159-talos/

alpine and k8s

https://dev.to/xphoniex/how-to-create-a-kubernetes-cluster-on-alpine-linux-kcg https://techviewleo.com/install-kubernetes-on-alpine-linux-with-k3s/

coreos

https://medium.com/@todd_78449/goodbye-alpine-linux-my-dear-friend-23ca0f30a6a8

containeros

flatcar

photonos

k3os

k3s, k0s, kind and microk8s

There are certified distributions which are not too resource hungry, especially if you need to self-host clusters, for example kind (kind.sigs.k8s.io),k3s (https://k3s.io/), k0s (https://k0sproject.io/) and microk8s. A list of excellent comparisons can be found here: https://blog.flant.com/small-local-kubernetes-comparison/ https://blog.radwell.codes/2021/05/best-kubernetes-distribution-for-local-environments/

kubeflow

Kubeflow makes deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

gke autopilot

https://polyaxon.com/ deployed on GKE can be used to queue.

https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview

floydhub

neuraxle

clearml

Auto-magical CI/CD to streamline an ML workflow. Experiment Manager, MLOps and data-Management

https://github.com/allegroai/clearml

https://clear.ml/docs

mlflow

Open source platform for the machine learning lifecycle. one part experiment manager https://mlflow.org/

trains

One part experiment tracking similar to MLFLOW, and one part (trains-agent) that helps schedule jobs into GPU servers.

kubernetes work queue

use a kubernetes work queue and just have them queue jobs.

Redis or rabbitMQ can be used to store the queue and then k8s can be used to spawn jobs from that.

https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/

apache airflow

https://airflow.apache.org/docs/stable/integration.html#gcp

minikube

minikube quickly sets up a local Kubernetes cluster on macOS, Linux, and Windows. Needs container or virtual machine manager, such as: Hyperkit, Hyper-V, KVM or Podman. It runs on Linux arm64 and with tweaks with Alpine.

In the case of kind, k3d, and minikube, you can go for one Linux VM (for a basic cluster). You can also use minikube by running Talos directly in Docker or QEMU with “talosctl”. More importantly can cache: https://minikube.sigs.k8s.io/docs/handbook/offline/

dkube

dkube is commercial an End-to-End MLOps Platform

nvidia deepops

https://github.com/NVIDIA/deepops

podman

https://www.techtarget.com/searchitoperations/feature/Podman-A-worthy-alternative-to-Docker-for-containers

https://docs.podman.io/en/latest/ podman can be installed on alpine and talos.

portainer and rancher

Rancher (https://rancher.com/) or Portainer (https://www.portainer.io/) for easier management and/or dashboard functionality. For example, you can create a deployment through the UI by following a wizard that also offers you configuration that you might want to use (e.g. resource limits) and then later retrieve the YAML manifest. They also make working with pre-made Helm charts (packages) more easy.

hn discussions

https://news.ycombinator.com/item?id=31795160

https://news.ycombinator.com/item?id=31845746 https://news.ycombinator.com/item?id=27919105

is slurm still relevant in a k8s age?

https://www.determined.ai/blog/slurm-lacking-deep-learning

https://www.run.ai/guides/slurm

https://www.dkube.io/blog/2021-04-06/

failure stories

https://k8s.af/

https://news.ycombinator.com/item?id=32596816

https://news.ycombinator.com/item?id=20163500

https://news.ycombinator.com/item?id=18953647

https://news.ycombinator.com/item?id=26106080