Configuring concurrency is one of the major knobs available for getting the most performance out of your model. In this doc, we’ll cover the options that are available to you.

Configuring concurrency

At a very high level, “concurrency” on Baseten refers to how many requests a single replica can process at the same time. There’s no universal best value for concurrenty — it depends on your model and the metrics that you are optimizing for (like throughput or latency).

In Baseten & Truss, there are two notions of concurrency:

There are two levers for managing concurrency:

  • Concurrency target: set in the Baseten UI, the number of requests that will be sent to a model at the same time
  • Predict concurrency: set in the Truss config, governs how many requests can go through the predict function on your Truss at once after they’ve made it to the model container.

Concurrency target

The concurrency target is set in the Baseten UI and governs the maximum number of requests that will be sent to a single model replica. Once the concurrency target is exceeded across all active replicas, the autoscaler will add more replicas (unless the max replica count is reached).

Let’s dive into a concrete example. Let’s say that we have:

  • A model deployment with exactly 1 replica.
  • A concurrency target of 2 requests.
  • 5 incoming requests.

In this situation, the first 2 requests will be sent to the model container, while the other 3 are placed in a queue. As the requests in the container are completed, requests are sent in from the queue.

However, if the model deployment’s autoscaling settings were to allow for more than one replica, this situation would trigger another replica to be created as there are requests in the queue.

Predict concurrency

Predict concurrency operates within the model container and governs how many requests will go through the Truss’ predict function concurrently.

A Truss can implement three functions to process a request:

  • preprocess: processes model input before inference. For example, in a Truss for Whisper, this function might download the audio file for transcription from a URL in the request body.
  • predict: performs model inference. This is the only function that blocks the GPU.
  • postprocess: processes model output after inference, such as uploading the results of a text-to-image model like Stable Diffusion to S3.

The predict concurrency setting lets you limit access to the GPU-blocking predict function while still handling pre- and post-processing steps with higher concurrency.

Predict concurrency is set in the Truss’ config.yaml file:

model_name: "My model with concurrency limits"
    predict_concurrency: 2 # the default is 1

To better understand this, let’s extend our previous example by zooming in on the model pod:

  • A model deployment with exactly 1 replica.
  • A concurrency target of 2 requests.
  • New: a predict concurrency of 1 request.
  • 5 incoming requests.

Here’s what happens:

  1. Two requests enter the model container.
  2. Both requests begin pre-processing immediately.
  3. When one request finishes pre-processing, it it let into the GPU to run inference. The other request will be queued if it finishes pre-processing before the first request finishes inference.
  4. After the first request finishes inference, it moves to post-processing and the second requests begins inference on the GPU.
  5. After the second request finishes inference, it can immediately move to post-processing whether or not the first request is still in post-processing.

This shows how predict concurrency protects the GPU resources in the model container while still allowing for high concurrency in the CPU-bound pre- and post-processing steps.

Concurrency target must be greater than or equal to predict concurrency, or your maximum predict concurrency will never be reached.