Training ML models in the cloud with AWS Sagemaker

At some point, when a startup outgrows its first office (which also happens to be the apartment of one of the founders), it becomes challenging to continue training Machine Learning (ML) models on your laptop, because:

Training becomes slow as the dataset grows. Also, you cannot close the laptop while training.
As ML team grows, people need a shared view on model’s performance.

What’s the next step? Typically it’s a cloud-based platform. The largest players here are AWS Sagemaker, Azure ML and Google AI Platform. Those systems are being very actively developed, driven by the high demand. As a result, they have exploded in the number of features and configurations for various use cases. The amount of features grows faster than the documentation that is constantly getting outdated. On top of that, the documentation is often fragmented. You have to check multiple tutorials, all of which use slightly (or not) different examples. Lastly, tutorials typically walk you through a complete, but toy example (usually with CIFAR data) that is not relevant for a real-life system.

Below, I’ve put together a reference guide for setting up an ML training pipeline with AWS Sagemaker. It is especially useful when you want to migrate an existing local pipeline.

I chose Sagemaker for a couple of reasons:

We already use other AWS services such as S3 and IAM. This way we use a familiar platform and simplify legal aspects for data storage and processing.
Sagemaker allows to setup an ML training pipeline with few changes to the existing code.

This guide covers the following topics:

Running Sagemaker training jobs with Tensorflow in Docker images. Loading training data from S3.
Dependency management. Real-life software usually has dependencies on external libraries. This aspect is largely overlooked in tutorials from vendors.
Circle-CI integration configuration.

Below you can see the overall pipeline architecture that we are going to implement:

Sagemaker pipeline

What’s not (yet) covered:

Monitoring and debugging. I will cover this in my next post.
Hyperparameter tuning / Sagemaker experiments. Here is a video that to learn more about these topics: https://www.youtube.com/watch?v=Tnv6HsT1r4I
Distributed training
Model serving from AWS
Managing training data through Sagemaker

Prerequisites

Python 3. AWS Sagemaker starting v1.56.2 supports python 3.7.
Tensorflow ≥ 2.1.
Python dependencies are pip-installable (either from pypi or github repositories).
Training/validation/… datasets are stored on S3.

AWS Resources

⚠️ When creating S3 buckets, make sure they are all created in the same AWS region, and use the same region when submitting Sagemaker jobs. Locally you can configure the AWS default region with aws configure command.

Packaging your training pipeline

Sagemaker offers two ways to spawn training jobs: using a pre-built image or shipping your own Docker image. We are going for the latter for the following reasons:

Not being dependent on the supported library versions for pre-built images. Pre-built Sagemaker images typically lag behind the latest library versions. With Docker you can ship any version you want, for example Tensorflow latest dev build.
Building a custom Docker image allows to include both public and private dependencies in an easy way.

This guide assumes the following directory structure of your ML project¹:

Below is the description of the purpose of each file that we’ll need:

prepare.sh: shell script to prepare the build environment – download private dependencies.
build_and_push_image.sh: shell script to build a docker image and store it on AWS. Can be called locally or from a CI service. Not needed if you use Circle-CI (see Building an image from Circle-CI below).
Dockerfile: standard file for Docker that defines how to assemble an image and run our code.
entrypoint_train.py: This is the entry script that is called by Sagemaker in the Docker container. It reads arguments passed to Sagemaker and handles any errors.
train.py: python script that loads the data and trains the model. Called by entrypoint_train.py.
run_sagemaker.py: Python script to start the training on AWS Sagemaker. Can be called locally or from a CI service.
requirements.txt: requirements file for packages needed specifically for training.

Let’s now look at the files and their purpose in detail. The repository with all the files from this guide is also available at: https://github.com/fairtiq/sagemaker-templates/tree/master/sagemaker-docker-basic - Connect to preview.

Building a docker image

First, prepare.sh script prepares the environment – makes sure our entry point is executable and downloads private dependencies:

Next is the script build_and_push_image.sh to build a docker image and push it to AWS. You need to customize/setup the following variables:

REPO_NAME - the name of the repository to create/add on ECR (Elastic Container Registry). It’s an AWS service that stores your docker images.
Replace git@github.com:COMPANY/PRIVATE-REPO.git with the address of the repository containing your private dependency. Add more if needed.
Make sure to run aws configure to setup the AWS region you are working from. The region has to match the S3 bucket region when your training/validation data is stored.
Change the image tag if needed. The script uses latest tag (line 18).

Whenever you push an image to the ECR with the same tag, the previous image will become untagged, but remain in the repository. If you want to clean untagged images automatically, you can set up a lifecycle policy as described here: https://aws.amazon.com/blogs/compute/clean-up-your-container-images-with-amazon-ecr-lifecycle-policies/

Building an image on Circle-CI

Circle-CI is a Continuous Integration cloud service. As we use it at FAIRTIQ, I have created a job to build and push image to ECR with it. Circle-CI provides a convenient aws-ecr orb that already has the build-and-push-image command that we need.

The config reads the branch name you are working on via the $CIRCLE_BRANCH variable and uses it as an image tag, so that you can have separate images for each branch/pull request.

Make sure to provide the following context variables to the job: AWS_ECR_ACCOUNT_URL, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_REGION. Also replace the REPO_NAME and MY_ML_PACKAGE placeholders with your values.

The AWS_ACCESS_KEY_ID should belong to a user with permissions to push images to ECR. I have created a user circleci-aws-ecr-push and attached an existing policy called AmazonEC2ContainerRegistryPowerUser:

Dockerfile

Both build_and_push_image.sh and Circle-CI job call the docker build command with a Dockerfile as an argument. Dockerfile describes how to build the container and what to package in it. Below is the file I am using for our ML system with the comments explaining the purpose of operations:

The process here is very straightforward. We first install the necessary system libs with apt-get, copy and install private and public python dependencies, copy your ML package into /opt/ml/code (but it can be anything really), and set the special SAGEMAKER_PROGRAM variable for Sagemaker to specify the entrypoint file.

⚠️ Tensorflow images on DockerHub use Python 3.6, so if you need 3.7 – you’ll need to build an image from another base.

Entrypoint

We have a separate entrypoint_train.py file that calls train.py so that the latter can be also easily called locally. The purpose of the entrypoint file is two-fold:

Load hyperparameters from the json file: with Docker, Sagemaker stores hyperparameters in a json file at a specific path in the file system.
Handle errors during the execution and store them for analysis.

You can basically use the file as is, only need to replace the name of your ml package on line 8.

Train.py

As you saw from the entrypoint file, train.py should provide the train_model() function that accepts model, data directories, and hyperparameters as arguments. Below is an example of such file:

This file can be both run locally by providing the necessary arguments, and inside Sagemaker by reading some arguments from the environment variables. That’s why we separated it from the Sagemaker-specific entrypoint file. The SM_CHANNEL_XXXX variables specify the location of the datasets that you need during training. These variables are set by Sagemaker when you submit a training job.

See this document for more details on the environment variables in Sagemaker: https://sagemaker.readthedocs.io/en/stable/using_tf.html

With all the files ready, you can go ahead and build and push image by running:

sh ./MY_ML_PACKAGE/sagemaker/build_and_push_image.sh

You need to run this script each time you modify your ML package (including dependencies), so I would advise to setup an automated job on a CI service of your choice.

Launching training on Sagemaker

In the next sections we look how to finally start Sagemaker training jobs with the Docker image we built.

Setting up Sagemaker execution role in AWS IAM

The starting point is to create the necessary role to run training jobs on Sagemaker.

The easiest way is to create it as a by-product of the notebook instance²: in AWS Console, go to Sagemaker → Notebook Instances, and click on “Create notebook instance“. Go down to the “Permissions and Encryption” section, and select “Create new role“, then follow the workflow. AWS will create a role named “AmazonSageMaker-ExecutionRole-YYYYMMDDT…“. You can cancel the notebook instance creation after the role is created.

Python script to submit a job

To submit a job to AWS Sagemaker, we create an Estimator object and call fit method on it:

This is the run_sagemaker.py script. Let’s break it down:

To import Sagemaker lib, you need to pip install sagemaker first.
SAGEMAKER_ROLE is the role we created in the previous step.
AWS_ECR_ACCOUNT_URL - url of form {company_id}.dkr.ecr.{region}.amazonaws.com. You can see it on your Amazon ECR repositories page.
train_instance_type – the Sagemaker instance type you want to do the training on³. Full list: https://aws.amazon.com/sagemaker/pricing/instance-types/.
output_path - Where Sagemaker will store training artifacts (such as model file and debug output).
hyperparameters – any parameters that you’d like to pass to your training job. They will be stored in a json file and copied to the docker image as discussed before.
inputs argument in fit(): each key specifies a dataset directory that the training job needs. These will be set as SM_CHANNEL_XXX variables described earlier. The values have to be directories, not files, otherwise Sagemaker will throw an exception. Read more about fit() arguments here: https://sagemaker.readthedocs.io/en/stable/using_tf.html#call-the-fit-method.

Setting up a role to launch training jobs

The Sagemaker execution role we defined above is assumed by Sagemaker after starting the training job: it is used to download the docker image, access and store data on S3, etc⁴. But to call the run_sagemaker.py script we also need a user with another set of permissions.

Here is the the basic policy that needs to be attached to the user running the script:

Launching training from Circle-CI

As before, our goal is to automate the tasks as much as possible, so we also want to run the training script from a CI service.

The run_sagemaker.py scripts needs a couple adjustments:

And here is our Circle CI job definition:

Putting it all together

Running the run_sagemaker.py script will create a training job with the name REPO_NAME-YYYY-MM-DD-HH-mm-ss-SSS, You should be able to see the status and the meta information about the job under the Training/Training jobs link in AWS Sagemaker console.

The output should look like the following:

When running in detached mode (via CI), you can see the same logs in AWS CloudWatch under /aws/sagemaker/TrainingJobs log group. After the job is completed, the model will be stored on S3 at s3://MY_BUCKET/sagemaker/model/JOB_NAME/output/model.tar.gz

The setup is quite involved, and I may have forgotten something along the way, so let me know if something doesn’t work for you.

Bonus

Curious what machine learning problems we are solving at FAIRTIQ? Check out my presentation from Applied ML Days on transport mode detection: https://www.youtube.com/watch?v=P6VXv55UHoM

Notes

1. I was using this guide from Amazon as a starting point: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_bring_your_own/tensorflow_bring_your_own.ipynb . It also describes how to set up the inference, which we don’t cover here.
2. I have also tried to create the role myself with SageMakerFullAccess and S3FullAccess permissions, but this role didn’t work later when starting training jobs. I didn’t find any differences in permissions between my role and the one created through a workflow.
3. To run this script in local mode, use train_instance_type="local" and remove sagemaker_session argument. This will emulate the whole Sagemaker pipeline, but will run the Docker image on your local machine instead.
4. Read more here: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job. Search to the RoleArn argument.

Training ML models in the cloud with AWS Sagemaker