Overview

WebSockets provide a persistent, full-duplex communication channel between clients and server-side models or chains. Full duplex means that chunks of data can be sent client→server and server→client simultaneously and repeatedly.

This guide covers how to implement WebSocket-based interactions for Truss models and Chains/Chainlets.

Unlike traditional request-response models, WebSockets allow continuous data exchange without reopening connections. This is useful for real-time applications, streaming responses, and maintaining lightweight interactions. Example applications could be real-time audio transcription, AI phone calls or agents with turn-based interactions.

WebSocket Usage in Chains/Chainlets

For Chains, WebSockets are wrapped in a reduced API object WebSocketProtocol. All processing happens in the run_remote method as usual. But inputs as well as outputs (or “return values”) are sent through the WebSocket object using async send_{X} and receive_{x} methods (there are variants for text, bytes and json).

Implementation Example

import fastapi
import truss_chains as chains

class Dependency(chains.ChainletBase):
    async def run_remote(self, name: str) -> str:
        return f"Hello from dependency, {name}."

@chains.mark_entrypoint
class Head(chains.ChainletBase):
    def __init__(self, dependency=chains.depends(Dependency)):
        self._dependency = dependency

    async def run_remote(self, websocket: chains.WebSocketProtocol) -> None:
        try:
            while True:
                message = await websocket.receive_text()
                if message == "dep":
                    response = await self._dependency.run_remote("Head")
                else
                    response = f"You said: {message}"
                await websocket.send_text(response)
        except fastapi.WebSocketDisconnect:
            print("Disconnected.")

Key Points

  • ⚠️ Development environments are not yet supported. You must promote your model to the production environment, e.g. truss push --promote.
  • The model is available at this endpoint: wss://model-<id>.api.baseten.co/v1/websocket .
  • WebSocket interactions in Chains must follow WebSocketProtocol (it is essentially the same as fastapi.Websocket, but you cannot accept the connection, because inside the Chainlet, the connection will be already accepted).
  • No other arguments are allowed in run_remote() when using WebSockets.
  • The return type must be None (if you return data to the client, send it through the WebSocket itself).
  • WebSockets can only be used only in the entrypoint, not in dependencies.
  • Unlike for truss it is not needed to explicitly set is_websocket_endpoint .

Invocation

Using websocat (get it), you can call the chain like this:

websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
   wss://chain-{CHAIN_ID}.api.baseten.co/v1/websocket
Hello  # Your input.
You said: Hello  # Direct response from Head Chainlet.
dep # Your next input.
Hello from dependency, Head # Response using Dependency Chainlet. 
# ctrl+c to close connection.

WebSocket Usage in Truss Models

In Truss models, WebSockets replace the conventional request/response flow: a single websocket method handles all processing and input/output communication goes through the WebSocket object (not arguments and return values). There are no of separate preprocess, predict, and postprocess methods anymore, but you can still implement load.

Example Truss model.py

import fastapi

class Model:
    async def websocket(self, websocket: fastapi.WebSocket):
        try:
            while True:
                message = await websocket.receive_text()
                await websocket.send_text(f"WS obtained: {message}")
        except fastapi.WebSocketDisconnect:
            pass

Key Points

  • ⚠️ You must set runtime.is_websocket_endpoint=true in config.yaml when deploying a Truss model with a WebSocket.
...
runtime:
  is_websocket_endpoint: true
  • ⚠️ Development environments are not yet supported. You must promote your model to the production environment, e.g. truss push --promote.
  • The model is available at this endpoint: wss://model-<id>.api.baseten.co/v1/websocket .
  • Continuous message exchange occurs in a loop until client disconnection. You can also decide to close the connection server-side if a certain condition is reached.
  • WebSockets enable bidirectional streaming, avoiding the need for multiple HTTP requests (or return values).
  • You must not implement any of the traditional methods predict, preprocess, postprocess.
  • The WebSocket object passed to the websocket-method has already accepted the connection, so you must not call websocket.accept() on it. You may close the connection though at the end of your processing. If you don’t close it explicitly, it will be closed after exiting your websocket-method.

Invocation

Using websocat (get it), you can call the model like this:

websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
   wss://model-{MODEL_ID}.api.baseten.co/v1/websocket
Hello # Your input.
WS obtained: Hello # Echoed from model.
# ctrl+c to close connection.

Deployment and Concurrency Considerations

Deployments

WebSockets are currently only supported in the production environment (not in other named environments). For now, promoting models directly to production is the most reliable way to develop with WebSockets. We’re actively working to get WebSockets to have parity with HTTP.

Scheduling

The default WebSocket scaling algorithm will schedule new WebSocket connections to the least-utilized replica until all replicas are at maxConcurrency - 1 concurrent WebSocket connections, at which point the total number of replicas will be incremented, until the maxReplica setting is hit.

Scale-down occurs when the number of replicas is greater than minReplica , and there are replicas with 0 concurrent connections. At this point, we begin scaling down idle replicas one-by-one.

Lifetime guarantees

WebSockets are guaranteed to last a minimum of 1 hour. In reality, a single WebSocket connection should be able to continue for much longer, but this is the guarantee that we provide in order to ensure that we can make changes to our system at a reasonable rate (including restarting and moving internal services as needed).

Concurrency changes

When scaling concurrency down, existing WebSockets will be allowed to continue until they complete, even if it means that a replica indefinitely has a greater number of ongoing connections than the max concurrency setting.

For instance, suppose:

  • You have a concurrency setting of 10, and currently have 10 websocket connections active on a replica.
  • Then, you change the concurrency setting to 5.

In this case, Baseten will not force any of the ongoing connections to close as a result of the concurrency change. They will be allowed to continue and close naturally (unless the 1 hour minimum has passed, and an internal restart is required).

Maximum message size

As a hard limit, we enforce a 100MiB maximum message size for any individual message sent over a websocket. This means that both clients and models are limited to 100MiB for each outgoing message, though there is no overall limit on the cumulative data that can be sent over a websocket.