Overview
WebSockets provide a persistent, full-duplex communication channel between clients and server-side models or chains. Full duplex means that chunks of data can be sent clientโserver and serverโclient simultaneously and repeatedly. This guide covers how to implement WebSocket-based interactions for Truss models and Chains/Chainlets. Unlike traditional request-response models, WebSockets allow continuous data exchange without reopening connections. This is useful for real-time applications, streaming responses, and maintaining lightweight interactions. Example applications could be real-time audio transcription, AI phone calls or agents with turn-based interactions. WebSockets are also useful for situations where you want to manage some state on the server-side, and you want requests that are part of the same โsessionโ to always be routed to the replica that maintains that state.WebSocket Usage in Truss Models
In Truss models, WebSockets replace the conventional request/response flow: a singlewebsocket
method handles all processing and input/output communication goes through the WebSocket object (not arguments and return values). There are no separate preprocess
, predict
, and postprocess
methods anymore, but you can still implement load
.
- Initialize your Truss:
- Replace the
predict
method with awebsocket
method to your Truss inmodel/model.py
. For example:
- Set
runtime.transport.kind=websocket
inconfig.yaml
:
Key Points
- Continuous message exchange occurs in a loop until client disconnection. You can also decide to close the connection server-side if a certain condition is reached
- This is done by calling
websocket.close()
- This is done by calling
- WebSockets enable bidirectional streaming, avoiding the need for multiple HTTP requests (or return values).
- You must not implement any of the traditional methods
predict
,preprocess
, orpostprocess
. - The WebSocket object passed to the
websocket
method has already accepted the connection, so you must not callwebsocket.accept()
on it. You may close the connection though at the end of your processing. If you donโt close it explicitly, it will be closed after exiting yourwebsocket
method.
Invocation
Usingwebsocat
(get it), you can call the model like this:
The path you use depends on which environment or deployment of the model youโd like to call.
- Environment:
wss://model-{MODEL_ID}.api.baseten.co/environments/{ENVIRONMENT_NAME}/websocket
. - Deployment:
wss://model-{MODEL_ID}.api.baseten.co/deployment/{DEPLOYMENT_NAME}/websocket
.
WebSocket Usage in Chains/Chainlets
For Chains, WebSockets are wrapped in a reduced API objectWebSocketProtocol
. All processing happens in the run_remote
method as usual. But inputs as well as outputs (or โreturn valuesโ) are sent through the WebSocket object using async send_{X}
and receive_{x}
methods (there are variants for text
, bytes
and json)
.
Implementation Example
Key Points
- WebSocket interactions in Chains must follow
WebSocketProtocol
(it is essentially the same asfastapi.Websocket
, but you cannot accept the connection, because inside the Chainlet, the connection will be already accepted). - No other arguments are allowed in
run_remote()
when using WebSockets. - The return type must be
None
(if you return data to the client, send it through the WebSocket itself). - WebSockets can only be used only in the entrypoint, not in dependencies.
- Unlike for truss models it is not needed to explicitly set
runtime.transport.kind
.
Invocation
Usingwebsocat
(get it), you can call the chain like this:
Similarly to models, WebSocket chains can also be invoked either via deployment or environment.See Reference for the full details.
Deployment and Concurrency Considerations
Scheduling
The WebSocket scaling algorithm will schedule new WebSocket connections to the least-utilized replica until all replicas are atmaxConcurrency - 1
concurrent WebSocket connections, at which point the total number of replicas will be incremented, until the maxReplica
setting is hit.
Scale-down occurs when the number of replicas is greater than minReplica
, and there are replicas with 0 concurrent connections. At this point, we begin scaling down idle replicas one-by-one.
Some other scheduling factors to consider when using WebSockets:
- Resource utilization: Standard HTTP requests are stateless and allow Baseten to optimize replica utilization and autoscaling. With WebSockets, long-lived connections are tied to specific replicas and count against your concurrency targetsโeven if underutilized. Itโs your responsibility to manage connection efficiency.
- Stateful complexity: WebSocket handlers often assume server-side state. This adds complexity around connection lifecycle management (e.g., handling unexpected disconnects, cleanup, reconnection logic).
Lifetime guarantees
WebSockets are guaranteed to last a minimum of 1 hour. In reality, a single WebSocket connection should be able to continue for much longer, but this is the guarantee that we provide in order to ensure that we can make changes to our system at a reasonable rate (including restarting and moving internal services as needed).Concurrency changes
When scaling concurrency down, existing WebSockets will be allowed to continue until they complete, even if it means that a replica indefinitely has a greater number of ongoing connections than the max concurrency setting. For instance, suppose:- You have a concurrency setting of 10, and currently have 10 websocket connections active on a replica.
- Then, you change the concurrency setting to 5.
Promotion
Just like with HTTP, you can promote a WebSocket model or chain to an environment via the REST API or UI. When promoting a WebSocket model or chain, new connections will be routed to the new deployment, but existing connections will remain connected to the current deployment until a termination happens. Depending on the length of the connection, this could result in old deployments taking longer to scale down than for HTTP deployments.Maximum message size
As a hard limit, we enforce a 100MiB maximum message size for any individual message sent over a websocket. This means that both clients and models are limited to 100MiB for each outgoing message, though there is no overall limit on the cumulative data that can be sent over a websocket.Monitoring
Just like with HTTP deployment, with WebSockets, we offer metrics on the performance of the deployment.Inference volume
Inference volume is tracked as the number of connections per minute. These metrics are published after the connection is closed, so these include the status that the connection was closed with. See WebsSocket connection close codes for a full list.End-to-end connection duration
Measured at different percentiles (p50, p90, p95, p99): End-to-end connection duration is tracked as the duration of the connection. Just like connections/minute, this is tracked after connections are closed.Connection input & output size
Measured at different percentiles (p50, p90, p95, p99):- Connection input size: Bytes sent by the client to the server for the duration of the connection.
- Connection output size: Bytes sent by the client to the server for the duration of the connection.