Sagemaker: Monitoring and debugging

This post is about setting up monitoring and debugging for Tensorflow model training on AWS Sagemaker. The variables are based on the Sagemaker Basic Setup guide from this blog.

Here, we cover the following topics:

  • Sagemaker metrics
  • Sagemaker Debugger
  • Tensorboard with Sagemaker

Sagemaker metrics

When you train models on Sagemaker, all the logs that your program emits are stored on AWS CloudWatch – a cloud logging service. Sagemaker allows us to define regular expressions to extract numeric metrics from these logs and visualise them.

The metrics are defined via the metric_definitions argument to the Estimator object. Here is how we can define the loss metric for Tensorflow models:

    tf_estimator = Estimator(
        {"Name": "train:loss", "Regex": "loss: (.*?) "},
        {"Name": "val:loss", "Regex": "val_loss: (.*?) "},

When you run the job with the new metric definitions, you will see both the concise metric graphs and the link View algorithm metrics on the job details page. The link navigates to the CloudWatch metrics page where you can see metric numbers in detail. The link always displays the graph for the last 3h. So if you’re opening a job that completed last day or week, you need to manually navigate to the right time period:

Sagemaker Debugger

Sagemaker Debugger allows to save the internal training state at periodic intervals, which you can analyse in two ways:

  • Interactively by viewing the data in Jupyter notebooks
  • In real-time by using debugger rules

To enable the Sagemaker debugger, we need to amend and files.

    s3_output_location = f"s3://{BUCKET_NAME}/sagemaker/{REPO_NAME}"

hook_config = DebuggerHookConfig(

tf_estimator = Estimator(

As a minimal configuration, we need to provide the location on S3 where the debug data will be stored, and list the collections to store. And then pass the hook config to the Estimator object.

The full list of built-in collections is available here:

More about available configuration options:

    import smdebug.tensorflow as smd


def train_model(args):
    debug_hook = smd.KerasHook.create_from_json_file()


Remember that we are using a custom docker image to run training jobs. In this case, Sagemaker creates a JSON config for the debugger and uploads it on our image. We can then use create_from_json_file function to load this config and pass it as a callback to our Tensorflow model’s fit method. Use KerasHook if your model uses Keras API with Tensorflow 2.x.

More about different hook classes here:

Interactive data analysis via Jupyter Notebooks

I haven't worked with this functionality myself, but there are some relevant notebooks from AWS:

Real-time analysis via debugger rules

Sagemaker also has a concept of debugger rules. They basically analyse the debugger output in realtime during the job execution, and watch for some common problems in training, like model overfitting, vanishing gradients, class imbalance, etc. I haven't experimented much with them.

The rules are also different depending on the model used. Here is the full list:

More about the relationship between the debugger hook and rules here:

Tensorboard with Sagemaker

Tensorboard is a tool to visualise and understand your Tensorflow models. To store Tensorboard output with Sagemaker, we need to add a couple more lines to our


tensorboard_output_config = TensorBoardOutputConfig(

tf_estimator = Estimator(

First, we configure the paths to store Tensorboard data on S3 and locally within our Docker container (no need to change). Second, we pass a tensorboard_output_config argument to the Estimator constructor.

⚠️ A debugger hook is required for Tensorboard on Sagemaker. It specifies the collections to store as we described above. If you specify only the Tensorboard config, Sagemaker will not throw any errors or warnings, but won't output anything on S3. So when you run your first job with such a config, check that the specified S3 directory is getting filled with data.

To run the Tensorboard on your machine:

  • aws configure if haven't done already.
  • export AWS_REGION=XXX
  • export AWS_LOG_LEVEL=3 (otherwise s3 lib gets too talkative)
  • pip install tensorboard
  • tensorboard --logdir "s3://BUCKET/sagemaker/REPO_NAME/tensorboard/JOB_NAME/tensorboard-output"

You can then access Tensorboard UI on http://localhost:6006.

Run Tensorboard on Sagemaker Notebooks

To run Tensorboard in a Jupyter notebook on Sagemaker:

    %load_ext tensorboard
logs_base_dir = f"s3://{BUCKET_NAME}/sagemaker/tensorboard/{REPO_NAME}-2020-04-27-18-32-45-942/tensorboard-output/"
%tensorboard --logdir {logs_base_dir}

If the magic command (%) times out, start Tensorboard from the command line:

    /home/ec2-user/anaconda3/envs/tensorflow2_p36/bin/tensorboard --logdir LOG_DIR

And then connect from the browser using a special proxy port: https://NOTEBOOK_URL/proxy/6006/

More details here:

Putting it all together

All the files from this post can be found at: