Configuring concurrency is one of the major knobs available for getting the most performance out of your model. In this doc, we’ll cover the options that are available to you.

What is concurrency, and why configure it?

At a very high level, “concurrency” in this context refers to how many requests a single replica can process at the same time. There are no right answers to what this number ought to be — the specifics of your model and the metrics you are optimizing for (throughput? latency?) matter a lot for determining this.

In Baseten & Truss, there are two notions of concurrency:

  • Concurrency Target — the number of requests that will be sent to a model at the same time
  • Predict Concurrency — once requests have made it onto the model container, the “predict concurrency” governs how many requests can go through the predict function on your Truss at once.

Concurrency Target

The concurrency target is set in the Baseten UI, and to re-iterate, governs the maximum number of requests that will be sent to a single model replica.

An important note about this setting is that it is also used as a part of the auto-scaling parameters. If all replicas have hit their Concurrency Target, this triggers Baseten’s autoscaling.

Let’s dive into a concrete example:

Let’s say that there is a single replica of a model, and the concurrency target is 2. If 5 requests come in, the first 2 will be sent to the replica, and the other 3 get queued up. Once the requests on the container complete the queued up requests will make it to the model container.

Remember that if all replicas have hit their concurrency target, this will trigger autoscaling. So in this specific example, the queuing of requests 3-5 will trigger another replica to come up, if the model has not hit its max replicas yet.

Predict Concurrency

Alright, so we’ve talked about the Concurreny Target feature that governs how many requests will be sent to a model at once. predict concurrency is a bit different — it operates on the level of the model container and governs how many requests will go through the predict function concurrently.

To get a sense for why this matters, let’s recap the structure of a Truss:

model.py
class Model:

    def __init__(self):
        ...

    def preprocess(self, request):
        ...

    def predict(self, request):
        ...

    def postprocess(self, response):
        ...

In this Truss model, there are three functions that are called in order to serve a request:

  • preprocess — this function is used to perform any prework / modifications on the request before the predict function runs. For instance, if you are running an image classification model, and need to download images from S3, this is a good placeholder to do it.
  • predict — this function is where the actual inference happens. It is likely where the logic that runs GPU code lives
  • postprocess — this function is used to perform any postwork / modifications on the response before it is returned to the user. For instance, if you are running a text-to-image model, this is a good place to implement the logic for uploading an image to S3.

You can see with these three functions and the behaviors that they are used for that you might want to have different levels of concurrency for the predict function. The most common need here is to limit access to the GPU, since multiple requests running on the GPU at the same time could cause serious degradation in performance.

Unlike Concurrency Target, which is configured in the Baseten UI, the Predict Concurrency is configured as a part of the Truss Config (in the config.yaml file).

config.yaml
model_name: "My model with concurrency limits"
...
runtime:
    predict_concurrency: 2 # the default is 1
...

To better understand this, let’s zoom in on the model pod:

Let’s say predict concurrency is 1.

  1. Two requests come in to the pod.
  2. Both requests will begin preprocessing immediately (let’s say, downloading images from S3).
  3. Once the first request finishes preprocessing, it will begin running on the GPU. The second request will then remain queued until the first request finishes running on the GPU in predict.
  4. After the first request finishes, the second request will begin being processed on the GPU
  5. Once the second request finishes, it will begin postprocessing, even if the first request is not done postprocessing

To reiterate, predict concurrency is really great to use if you want to protect your GPU resource on your model pod, while still allowing for high concurrency for the pre and post-process steps.

Remember that to actually achieve the predict concurrency you desire, the Concurrency Target must be at least that amount, so that the requests make it to the model container.