A “cold start” is the time it takes to spin up a new instance of a model server. Fast cold starts are essential for useful autoscaling, especially scale to zero.

While Baseten has platform-level features that speed up cold starts, a lot of the possible optimizations are model-specific. This guide provides techniques for making cold starts faster for a given model.

Use caching to reduce cold start time

Everything that happens in the load() function in the Truss is part of the cold start time. This generally includes loading binaries for one or more models from a data store like Hugging Face, which can be one of the longest-running steps.

Caching model weights can dramatically improve cold start times. Learn how to cache model weights in Truss with this guide:

Truss guide: Caching model weights

Accelerate cold starts by caching your weights

Use wake to hide cold start time

Every deployment has a wake endpoint that can be used to activate the model when it’s scaled to zero. This can be used to hide the cold start time from the end user.

Imagine you have an app where the user can enter a prompt and get an image from Stable Diffusion. The app has inconsistent traffic, so you have a minimum replica count of zero. Here’s what happens when the model is scaled to zero and the app gets a user:

  1. The user loads the app
  2. The user enters input and the app calls the model endpoint
  3. Baseten spins up an instance and loads the model (the time this takes is the cold start)
  4. Model inference runs
  5. After waiting, the user receives the image they requested

But, you can use the wake endpoint to hide the cold start time from the user. Instead:

  1. The user loads the app
  2. The app calls the wake endpoint for the scaled-to-zero model
  3. Baseten spins up an instance and loads the model (the time this takes is the cold start)
  4. Meanwhile, the user enters input and the app calls the model endpoint
  5. Model inference runs
  6. The user receives the image they requested

Wake is also useful when you have predictable traffic, such as starting up the model during business hours. It can also be triggered manually from the model dashboard when needed, like for a demo.