Automating the end-to-end lifecycle of Machine Learning applications
Machine Learning applications are becoming famous in the tech industry — however, the process for developing, deploying, and continuously improving them is complex compared to more traditional software. Continuous Delivery for Machine Learning (CD4ML) brings Continuous Delivery principles and practices to Machine Learning services.
In “Hidden Technical Debt in Machine Learning Systems”, a paper by Google, they highlight that in real-world Machine Learning (ML) systems, only a small fraction comprises actual ML code.
As the number of ML services kept increasing at Mercari, the more efforts were spent on the operations of the existing systems. It was preventing us from starting new things, thus it was time for us to develop a continuous delivery (CD) system for ML services and minimize the efforts. Thus, we describe a continuous delivery system for ML services (CD4ML). The purpose of this is to automate continuous training of models and canary release. This canary release is a bit different from the usual definition, as the purpose is not only to confirm the system is OK but also to confirm the model is OK and it tends to take longer.
At Mercari, we use Polyaxon for exploration of Machine Learning models on Kubernetes. So, all the existing services had Infrastructure setup on Polyaxon and we just started building an end-to-end workflow with Kubeflow.
We wanted to make as few changes as possible and wanted our new CD4ML System compatible with Mercari’s existing ML training Infrastructure. So, we kept our Polyaxon training in-place and integrated it with our new Kubeflow by triggering all these components one by one from Kubeflow. Scheduling is also controlled from Kubeflow.
Machine learning/Deep Learning assets are generally very huge and, at Mercari, they range from a few hundred megabytes to somewhere around 15 gigabytes. Depending on the size of the asset, we generally either include the asset within the Docker image or, if it’s huge, we store the assets in a Persistent Volume and attach it to pods.
For now, this post will assume that the assets are lighter (i.e. assets are embedded inside the Docker image).
Before CD4ML, we used to update a model in the following way.
This was time-consuming and involved a lot of repeated work to be done every time we needed to update a model. So, we decided to automate these repeated steps.
So, we had the following requirements:
Now, we have all the pieces distributed and we wanted to create a platform to connect all of these.
We made use of Kubeflow to automate the manual process involved in updating the new model. We then connected each of the above steps by passing variables to the downstream Kubeflow components.
Before moving on, I’d like to provide a quick overview of Kubeflow Pipelines & Polyaxon.
Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.
Polyaxon makes it faster, easier, and more efficient to develop machine learning and deep learning applications.
Here is the clear picture of what’s happening behind the first component:
The job of the very first component is to trigger the Polyaxon job that can fetch raw data from various data warehouses or image storage etc. Then, it later stores the processed data ready for training into the GCS and passes the variable that has the location of the data to the next corresponding component. We developed a custom Kubeflow Pipelines component to authorize a user and to submit a job from Kubeflow Pipelines easily.
The output produced by Kubeflow components can be utilized in the downstream components in the following way.
Here, component is the component which writes the var_name output.
The next component uses the variable from the preceding component to fetch the data and uses it for training. This component stores these trained model resources to some specific location in the GCS and passes this location value to the succeeding step.
Thus, we keep adding components in the Kubeflow pipeline that create a workflow to trigger Polyaxon Job/Experiment one-by-one, thus connecting each of these by passing data in between Kubeflow Components.
Same as above, we will have an offline evaluation component that evaluates the newly trained model and, depending on the evaluation result, we either proceed to the next steps or stop the workflow here.
We can make use of dsl.Condition to evaluate conditional statements and decide on next steps
with dsl.Condition(is_model_valid == 1, name=”Conditional component”):
Let’s assume offline evaluation is successful; all we have left to do is to build a Docker image and push it to GCR.
Kaniko is a tool to build container images from a Dockerfile, inside a container or Kubernetes cluster.
Here is an example of dsl.ContainerOp that uses kaniko to build a dockerfile and push to a registry.
name="Docker image build",
This is how the overall workflow functions. The part inside the dotted line is configured as a recurring run in Kubeflow.
At Mercari, we use spinnaker for deploying to our Kubernetes clusters. In order to programmatically trigger pipelines, we configured Spinnaker to subscribe and listen to pub/sub topic to which an event is pushed in an event of a new docker image push to Container Registry.
Therefore, this pipeline automatically deploys to the canary environment in the event of a new image in the registry.
After releasing to canary, we route traffic either to canary or production based on the experiment group the user is in. For this purpose, we evaluate the feature flag in the upstream service and send a header to the destination service and route traffic based on the header match using Istio / Virtual Service.
The following is an example of a VirtualService, which we use for routing. The feature flag is evaluated in the upstream services and based on the evaluation, the corresponding header is sent.
The current version of CD4ML includes automated ML training and deploying to the canary environment.
In the future, we intend to enhance the project with more functionality. One such feature is:
If building a large-scale distributed system interests you, consider applying for a role at Mercari. https://www.mercari.com/careers/