How we optimized our Machine Learning Training Infrastructure Costs

At Mercari US, we use Polyaxon and Kubeflow Pipelines to run our machine learning pipelines. Both Kubeflow and Polyaxon are deployed on a Kubernetes cluster dedicated to Machine Learning.

Several Machine Learning teams use this cluster and it has a sporadic workload. This often leads to a significant increase in the number of nodes for a short period of time followed by a downscaling of nodes with the help of Cluster Autoscaler.

‍

What is Cluster Autoscaler?

Cluster autoscaler is responsible for adjusting the size of a Kubernetes cluster by increasing or decreasing the number of nodes.

Nodes are increased when there are pending pods due to insufficient resources, and decreased when nodes are underutilized and pods can be placed on other nodes.

Instead of using a fixed number of nodes with a Kubernetes, cluster autoscaler dynamically scales the number of nodes depending on the resource requests for workloads. This leads to a significant decrease in infrastructure costs when the workloads are sporadic, such as machine learning training.

‍

Challenges of Rapid Downscaling

We noticed that we were able to upscale our ML cluster quickly, but weren’t able to downscale as quickly, leading to wastage of resources and increasing our infrastructure costs.

The MLPlatform Team at Mercari US, uses a managed Google Kubernetes Engine(GKE) cluster with several managed components including the cluster autoscaler. By investigating the logs of our cluster autoscaler, we found the following types of pods were preventing our cluster autoscaler from working:

Managed kube-system pods without Pod Disruption Budgets (PDB)
Pods not backed by a controller object (not created using Deployment/Stateful Set etc)
Pods with local storage (using hostpath or empty dir)

‍

Solution

The solution was to use PDBs to allow ClusterAutoscaler to evict pods, isolate some deployments into a separate smaller “control-plane” node pool and use Gatekeeper Assign CRDs to select the “control-plane” node pool.

‍

PDB

For some kube-system deployments we could simply add a PDB.

For example, kube-dns deployments have several replicas. The number of replicas is configured by another deployment kube-dns-autoscaler.

Due to redundancy it is safe to add a PDB, which allows the Cluster Autoscaler to downscale nodes by removing replicas.

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: kube-dns-pdb

namespace: kube-system

spec:

minAvailable: 50%

selector:

matchLabels:

k8s-app: kube-dns

‍

However there are several deployments which only run a single replica. For example the deployment metric-server collects metrics from kubelet and exposes it to Kubernetes API Server. This is then used by Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) to scale up various other deployments on the cluster.

If a single replica of metric-server is running on the cluster and a PDB is set on it, during downscaling of the cluster, the replica could be evicted by Cluster Autoscaler. Until a new metric-server is spun up and metrics from kubelets are scrapped, there will be no metrics available in Kubernetes API Server. If traffic increased during this period, it is possible, HPA or VPA would not be able to scale up due to the absence of metrics, possibly leading to downtime.

It is not always possible to run multiple copies of a deployment to have redundancy. Running multiple replicas of a controller could lead to undesirable effects if requests are handled by both replicas. In instances like this, adding a PDB, could make the controller unavailable.

‍

Control Plane Node Pool

There were several managed deployments in the kube-system namespace which had a single replica, so we isolated them into a separate smaller and cheaper node pool with minimal resources, called control-plane node pool. The control-plane node pool was then labeled purpose:control-plane, so we could use node selectors to select this particular node pool.

The control-plane node pool consisted of workloads which ran forever. Scaling this node pool was no longer a concern due to almost static workloads. We moved some of the other system workloads, which didn’t cause any auto scaling issues to this node pool, to further increase overall system stability.

‍

Gatekeeper

While we were able to move some of our workloads to the control-plane node pool. For some of our other applications we were unable to set node selectors, since the options to do so weren’t exposed to end-users.

For example, enabling AI Platform Pipelines on GKE creates a cloudsqlproxy deployment with local storage(empty dir) which blocks cluster autoscaler. However since the deployment is managed by GKE, there is no option to set the node selector on this deployment, other than overwriting the deployment manifest manually. This would only last until an update is made to AI Platform Pipelines, which would not only update the application but also revert the node selector patch.

Thus, we needed a way to add node selectors to a pod without modifying the deployment. A mutating webhook configured to validate and mutate pods was one of the ways to do it. Once a mutating webhook is deployed, even if changes were made to the managed applications/deployments, the mutating webhook would patch the new pod and add the required node selector.

Instead of creating a custom mutating webhook we decided to use Gatekeeper through Policy Controller. Gatekeeper is a policy management tool mainly used for enforcing policies in a multi-tenant cluster, however it has minimal support for mutating workloads as well. Since our patches were a set of static rules, Gatekeeper’s Assign CRD was perfect for our use case.

The following diagram shows how pods are mutated by Gatekeeper’s Mutating Webhook.

‍

Deployments Create Requests are created by Managed Applications(Helm Charts/GKE managed applications) without exposing configuration options for setting node selectors. The deployment controller then creates replica sets, which in turn creates Pod Create Requests. The Pod Create Requests are mutated by Gatekeeper Mutating Webhook to add a node selector. The kube-scheduler then schedules these workloads onto the control-plane node pool.

The following manifest shows configuration for Assign CRD deployed for adding a node selector for single replica, GKE managed deployments in the Kubesystem Namespace.

‍

# Target: Deployments with one replica in Kubesystem Namespace

# Reason: Preventing Cluster Autoscaler,

# since the pod is a part of Kubesystem Namespace

apiVersion: mutations.gatekeeper.sh/v1alpha1

kind: Assign

metadata:

name: control-plane-mutator-for-kubesystem

spec:

applyTo:

- groups: [""]

kinds: ["Pod"]

versions: ["v1"]

match:

namespaces: ["kube-system"]

labelSelector:

matchExpressions:

- key: k8s-app

operator: In

values:

- glbc

- event-exporter

- konnectivity-agent-autoscaler

- kube-state-metrics

- kube-dns-autoscaler

- metrics-server

location: "spec.nodeSelector.purpose"

parameters:

assign:

value: control-plane

‍

Conclusion

In large enterprises, Platform Teams are often tasked with managing Infrastructure costs. Efficient use of compute resources, such as Specialized large servers used for Machine Learning Training, is therefore essential.

At Mercari US, by using a combination of PDBs, small “control-plane” node pools and mutating webhooks we were able to improve cluster downscaling. Earlier we had several pods with local storage and Kubesystem pods preventing downscaling of some of our large nodes, as a result the daily infrastructure costs for the GKE cluster were high. After applying these techniques we were able to reduce the cost significantly. In the future, we will be applying these techniques to some of our non ML Clusters as well.

‍

About the Author

Abhishek Vilas Munagekar is a software engineer on the Machine Learning Platform Team at Mercari US. Abhishek joined Mercari Japan in 2018 as a Machine Learning Engineer (MLE), working on Counterfeit Item Detection using Machine Learning. He has experience applying machine learning techniques for image and video processing applications and has recently moved over to focus on the development of Machine Learning Platform for building and serving Machine Learning solutions at Mercari.

‍