Continuous delivery and automation pipelines in machine learning with Polyaxon and Kubeflow Pipelines

Machine Learning applications are becoming famous in the tech industry — however, the process for developing, deploying, and continuously improving them is complex compared to more traditional software. Continuous Delivery for Machine Learning (CD4ML) brings Continuous Delivery principles and practices to Machine Learning services.

In “Hidden Technical Debt in Machine Learning Systems”, a paper by Google, they highlight that in real-world Machine Learning (ML) systems, only a small fraction comprises actual ML code.

In this diagram, the rest of the system is composed of data collection, analysis, serving infrastructure, monitoring etc.

As the number of ML services kept increasing at Mercari, the more efforts were spent on the operations of the existing systems. It was preventing us from starting new things, thus it was time for us to develop a continuous delivery (CD) system for ML services and minimize the efforts. Thus, we describe a continuous delivery system for ML services (CD4ML). The purpose of this is to automate continuous training of models and canary release. This canary release is a bit different from the usual definition, as the purpose is not only to confirm the system is OK but also to confirm the model is OK and it tends to take longer.

At Mercari, we use Polyaxon for exploration of Machine Learning models on Kubernetes. So, all the existing services had Infrastructure setup on Polyaxon and we just started building an end-to-end workflow with Kubeflow.

We wanted to make as few changes as possible and wanted our new CD4ML System compatible with Mercari’s existing ML training Infrastructure. So, we kept our Polyaxon training in-place and integrated it with our new Kubeflow by triggering all these components one by one from Kubeflow. Scheduling is also controlled from Kubeflow.

Machine learning/Deep Learning assets are generally very huge and, at Mercari, they range from a few hundred megabytes to somewhere around 15 gigabytes. Depending on the size of the asset, we generally either include the asset within the Docker image or, if it’s huge, we store the assets in a Persistent Volume and attach it to pods.

For now, this post will assume that the assets are lighter (i.e. assets are embedded inside the Docker image).

‍

Situation of model update before CD4ML

Before CD4ML, we used to update a model in the following way.

Start a Polyaxon job/experiment to fetch data, train the model and store model resources in Cloud Storage.
Do an offline evaluation of the model and proceed to the next step if it satisfies the criteria.
Trigger Cloud Build and push Docker images to Container Registry.
Release the new model and start online AB testing.

This was time-consuming and involved a lot of repeated work to be done every time we needed to update a model. So, we decided to automate these repeated steps.

So, we had the following requirements:

ML models in the production environment are continuously trained.
Offline evaluation of the models is done automatically and models that satisfy the predefined acceptable criteria will proceed to the next step.
The trained model is released in a canary manner automatically.

Now, we have all the pieces distributed and we wanted to create a platform to connect all of these.

We made use of Kubeflow to automate the manual process involved in updating the new model. We then connected each of the above steps by passing variables to the downstream Kubeflow components.

‍

Kubeflow Pipelines + Polyaxon

Before moving on, I’d like to provide a quick overview of Kubeflow Pipelines & Polyaxon.

What is Kubeflow Pipelines?

Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.

What is Polyaxon?

Polyaxon makes it faster, easier, and more efficient to develop machine learning and deep learning applications.

Polyaxon supports scalable and parallel hyperparameter tuning jobs on top of Kubernetes.
It allows developers to write YAML files to ensure experiment reproducibility.

‍

Fetching data

‍

Here is the clear picture of what’s happening behind the first component:

The job of the very first component is to trigger the Polyaxon job that can fetch raw data from various data warehouses or image storage etc. Then, it later stores the processed data ready for training into the GCS and passes the variable that has the location of the data to the next corresponding component. We developed a custom Kubeflow Pipelines component to authorize a user and to submit a job from Kubeflow Pipelines easily.

The output produced by Kubeflow components can be utilized in the downstream components in the following way.

component.outputs[“var_name”]

Here, component is the component which writes the var_name output.

‍

Training

‍

The next component uses the variable from the preceding component to fetch the data and uses it for training. This component stores these trained model resources to some specific location in the GCS and passes this location value to the succeeding step.

Thus, we keep adding components in the Kubeflow pipeline that create a workflow to trigger Polyaxon Job/Experiment one-by-one, thus connecting each of these by passing data in between Kubeflow Components.

‍

Offline evaluation

Same as above, we will have an offline evaluation component that evaluates the newly trained model and, depending on the evaluation result, we either proceed to the next steps or stop the workflow here.

How we use conditional statements to decide the direction of the workflow

We can make use of dsl.Condition to evaluate conditional statements and decide on next steps

with dsl.Condition(is_model_valid == 1, name=”Conditional component”):

‍

Building and pushing docker images to Container Registry

Let’s assume offline evaluation is successful; all we have left to do is to build a Docker image and push it to GCR.

Kaniko is a tool to build container images from a Dockerfile, inside a container or Kubernetes cluster.

Here is an example of dsl.ContainerOp that uses kaniko to build a dockerfile and push to a registry.

kfp.dsl.ContainerOp(
name="Docker image build",
image="gcr.io/kaniko-project/executor:latest",
arguments=[
"--dockerfile",
"docker/Dockerfile.server",
"--destination",
container_registry_loc,
"--context",
f"dir://{{work_dir}}"]
)

‍

Overall workflow and recurring run

‍

This is how the overall workflow functions. The part inside the dotted line is configured as a recurring run in Kubeflow.

‍

Canary pipeline

At Mercari, we use spinnaker for deploying to our Kubernetes clusters. In order to programmatically trigger pipelines, we configured Spinnaker to subscribe and listen to pub/sub topic to which an event is pushed in an event of a new docker image push to Container Registry.

‍

Therefore, this pipeline automatically deploys to the canary environment in the event of a new image in the registry.

After releasing to canary, we route traffic either to canary or production based on the experiment group the user is in. For this purpose, we evaluate the feature flag in the upstream service and send a header to the destination service and route traffic based on the header match using Istio / Virtual Service.

The following is an example of a VirtualService, which we use for routing. The feature flag is evaluated in the upstream services and based on the evaluation, the corresponding header is sent.

apiVersion:networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: virtualservice
namespace: namespace-name
spec:
hosts:
- service-name.namespace-name.svc.cluster.local
http:
- match:
- headers:
version:
exact: "canary"
route:
- destination:
host: dest-canary-service.dest-namespace.svc.cluster.local
port:
number: 80

‍

Future plans

The current version of CD4ML includes automated ML training and deploying to the canary environment.

In the future, we intend to enhance the project with more functionality. One such feature is:

Automate online AB testing: We plan to automate the online AB testing and rollout the winner model automatically to all of our users.

‍

If building a large-scale distributed system interests you, consider applying for a role at Mercari. https://www.mercari.com/careers/

‍

Acknowledgements

Special thanks to Kosuke Arase and Shotaro Kohama for their valuable support and making this project a fabulous success!

‍