At some point, when a startup outgrows its first office (which also happens to be the apartment of one of the founders), it becomes challenging to continue training Machine Learning (ML) models on your laptop, because:
Training becomes slow as the dataset grows. Also, you cannot close the laptop while training.What’s the next step? Typically it’s a cloud-based platform. The largest players here are AWS Sagemaker, Azure ML and Google AI Platform. Those systems are being very actively developed, driven by the high demand. As a result, they have exploded in the number of features and configurations for various use cases. The amount of features grows faster than the documentation that is constantly getting outdated. On top of that, the documentation is often fragmented. You have to check multiple tutorials, all of which use slightly (or not) different examples. Lastly, tutorials typically walk you through a complete, but toy example (usually with CIFAR data) that is not relevant for a real-life system.
Below, I’ve put together a reference guide for setting up an ML training pipeline with AWS Sagemaker. It is especially useful when you want to migrate an existing local pipeline.
I chose Sagemaker for a couple of reasons:
This guide covers the following topics:
Below you can see the overall pipeline architecture that we are going to implement:
What’s not (yet) covered:
⚠️ When creating S3 buckets, make sure they are all created in the same AWS region, and use the same region when submitting Sagemaker jobs. Locally you can configure the AWS default region with aws configure
command.
Sagemaker offers two ways to spawn training jobs: using a pre-built image or shipping your own Docker image. We are going for the latter for the following reasons:
This guide assumes the following directory structure of your ML project1:
Below is the description of the purpose of each file that we’ll need:
prepare.sh
: shell script to prepare the build environment – download private dependencies.build_and_push_image.sh
: shell script to build a docker image and store it on AWS. Can be called locally or from a CI service. Not needed if you use Circle-CI (see Building an image from Circle-CI below).Dockerfile
: standard file for Docker that defines how to assemble an image and run our code.entrypoint_train.py
: This is the entry script that is called by Sagemaker in the Docker container. It reads arguments passed to Sagemaker and handles any errors.train.py
: python script that loads the data and trains the model. Called by entrypoint_train.py
.run_sagemaker.py
: Python script to start the training on AWS Sagemaker. Can be called locally or from a CI service.requirements.txt
: requirements file for packages needed specifically for training.Let’s now look at the files and their purpose in detail. The repository with all the files from this guide is also available at: https://github.com/fairtiq/sagemaker-templates/tree/master/sagemaker-docker-basic - Connect to preview.
First, prepare.sh
script prepares the environment – makes sure our entry point is executable and downloads private dependencies:
Next is the script build_and_push_image.sh
to build a docker image and push it to AWS. You need to customize/setup the following variables:
REPO_NAME
- the name of the repository to create/add on ECR (Elastic Container Registry). It’s an AWS service that stores your docker images.git@github.com:COMPANY/PRIVATE-REPO.git
with the address of the repository containing your private dependency. Add more if needed.aws configure
to setup the AWS region you are working from. The region has to match the S3 bucket region when your training/validation data is stored.latest
tag (line 18).Whenever you push an image to the ECR with the same tag, the previous image will become untagged, but remain in the repository. If you want to clean untagged images automatically, you can set up a lifecycle policy as described here: https://aws.amazon.com/blogs/compute/clean-up-your-container-images-with-amazon-ecr-lifecycle-policies/
Circle-CI is a Continuous Integration cloud service. As we use it at FAIRTIQ, I have created a job to build and push image to ECR with it. Circle-CI provides a convenient aws-ecr orb that already has the build-and-push-image
command that we need.
The config reads the branch name you are working on via the $CIRCLE_BRANCH
variable and uses it as an image tag, so that you can have separate images for each branch/pull request.
Make sure to provide the following context variables to the job: AWS_ECR_ACCOUNT_URL, AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
and AWS_REGION
. Also replace the REPO_NAME
and MY_ML_PACKAGE
placeholders with your values.
The AWS_ACCESS_KEY_ID
should belong to a user with permissions to push images to ECR. I have created a user circleci-aws-ecr-push
and attached an existing policy called AmazonEC2ContainerRegistryPowerUser
:
Both build_and_push_image.sh
and Circle-CI job call the docker build
command with a Dockerfile as an argument. Dockerfile describes how to build the container and what to package in it. Below is the file I am using for our ML system with the comments explaining the purpose of operations:
The process here is very straightforward. We first install the necessary system libs with apt-get, copy and install private and public python dependencies, copy your ML package into /opt/ml/code
(but it can be anything really), and set the special SAGEMAKER_PROGRAM
variable for Sagemaker to specify the entrypoint file.
⚠️ Tensorflow images on DockerHub use Python 3.6, so if you need 3.7 – you’ll need to build an image from another base.
We have a separate entrypoint_train.py
file that calls train.py
so that the latter can be also easily called locally. The purpose of the entrypoint file is two-fold:
You can basically use the file as is, only need to replace the name of your ml package on line 8.
Train.py
As you saw from the entrypoint file, train.py
should provide the train_model()
function that accepts model, data directories, and hyperparameters as arguments. Below is an example of such file:
This file can be both run locally by providing the necessary arguments, and inside Sagemaker by reading some arguments from the environment variables. That’s why we separated it from the Sagemaker-specific entrypoint file. The SM_CHANNEL_XXXX
variables specify the location of the datasets that you need during training. These variables are set by Sagemaker when you submit a training job.
See this document for more details on the environment variables in Sagemaker: https://sagemaker.readthedocs.io/en/stable/using_tf.html
With all the files ready, you can go ahead and build and push image by running:
sh ./MY_ML_PACKAGE/sagemaker/build_and_push_image.sh
You need to run this script each time you modify your ML package (including dependencies), so I would advise to setup an automated job on a CI service of your choice.
In the next sections we look how to finally start Sagemaker training jobs with the Docker image we built.
The starting point is to create the necessary role to run training jobs on Sagemaker.
The easiest way is to create it as a by-product of the notebook instance2: in AWS Console, go to Sagemaker → Notebook Instances, and click on “Create notebook instance“. Go down to the “Permissions and Encryption” section, and select “Create new role“, then follow the workflow. AWS will create a role named “AmazonSageMaker-ExecutionRole-YYYYMMDDT…“. You can cancel the notebook instance creation after the role is created.
To submit a job to AWS Sagemaker, we create an Estimator
object and call fit
method on it:
This is the run_sagemaker.py
script. Let’s break it down:
pip install sagemaker
first.SAGEMAKER_ROLE
is the role we created in the previous step.AWS_ECR_ACCOUNT_URL
- url of form {company_id}.dkr.ecr.{region}.amazonaws.com
. You can see it on your Amazon ECR repositories page.train_instance_type
– the Sagemaker instance type you want to do the training on3. Full list: https://aws.amazon.com/sagemaker/pricing/instance-types/.output_path
- Where Sagemaker will store training artifacts (such as model file and debug output).hyperparameters
– any parameters that you’d like to pass to your training job. They will be stored in a json file and copied to the docker image as discussed before.inputs
argument in fit()
: each key specifies a dataset directory that the training job needs. These will be set as SM_CHANNEL_XXX
variables described earlier. The values have to be directories, not files, otherwise Sagemaker will throw an exception. Read more about fit()
arguments here: https://sagemaker.readthedocs.io/en/stable/using_tf.html#call-the-fit-method.The Sagemaker execution role we defined above is assumed by Sagemaker after starting the training job: it is used to download the docker image, access and store data on S3, etc4. But to call the run_sagemaker.py
script we also need a user with another set of permissions.
Here is the the basic policy that needs to be attached to the user running the script:
As before, our goal is to automate the tasks as much as possible, so we also want to run the training script from a CI service.
The run_sagemaker.py
scripts needs a couple adjustments:
And here is our Circle CI job definition:
Running the run_sagemaker.py
script will create a training job with the name REPO_NAME-YYYY-MM-DD-HH-mm-ss-SSS
, You should be able to see the status and the meta information about the job under the Training/Training jobs link in AWS Sagemaker console.
The output should look like the following:
When running in detached mode (via CI), you can see the same logs in AWS CloudWatch under /aws/sagemaker/TrainingJobs
log group. After the job is completed, the model will be stored on S3 at s3://MY_BUCKET/sagemaker/model/JOB_NAME/output/model.tar.gz
The setup is quite involved, and I may have forgotten something along the way, so let me know if something doesn’t work for you.
Curious what machine learning problems we are solving at FAIRTIQ? Check out my presentation from Applied ML Days on transport mode detection: https://www.youtube.com/watch?v=P6VXv55UHoM
train_instance_type="local"
and remove sagemaker_session
argument. This will emulate the whole Sagemaker pipeline, but will run the Docker image on your local machine instead.