Websockets
Enable real-time, streaming, bidirectional communication using WebSockets for Truss models and Chainlets.
Overview
WebSockets provide a persistent, full-duplex communication channel between clients and server-side models or chains. Full duplex means that chunks of data can be sent client→server and server→client simultaneously and repeatedly.
This guide covers how to implement WebSocket-based interactions for Truss models and Chains/Chainlets.
Unlike traditional request-response models, WebSockets allow continuous data exchange without reopening connections. This is useful for real-time applications, streaming responses, and maintaining lightweight interactions. Example applications could be real-time audio transcription, AI phone calls or agents with turn-based interactions.
WebSocket Usage in Chains/Chainlets
For Chains, WebSockets are wrapped in a reduced API object WebSocketProtocol
. All processing happens in the run_remote
method as usual. But inputs as well as outputs (or “return values”) are sent through the WebSocket object using async send_{X}
and receive_{x}
methods (there are variants for text
, bytes
and json)
.
Implementation Example
Key Points
- ⚠️ Development environments are not yet supported. You must promote your model to the
production
environment, e.g.truss push --promote
. - The model is available at this endpoint:
wss://model-<id>.api.baseten.co/v1/websocket
. - WebSocket interactions in Chains must follow
WebSocketProtocol
(it is essentially the same asfastapi.Websocket
, but you cannot accept the connection, because inside the Chainlet, the connection will be already accepted). - No other arguments are allowed in
run_remote()
when using WebSockets. - The return type must be
None
(if you return data to the client, send it through the WebSocket itself). - WebSockets can only be used only in the entrypoint, not in dependencies.
- Unlike for truss it is not needed to explicitly set
is_websocket_endpoint
.
Invocation
Using websocat
(get it), you can call the chain like this:
WebSocket Usage in Truss Models
In Truss models, WebSockets replace the conventional request/response flow: a single websocket
method handles all processing and input/output communication goes through the WebSocket object (not arguments and return values). There are no of separate preprocess
, predict
, and postprocess
methods anymore, but you can still implement load
.
Example Truss model.py
Key Points
- ⚠️ You must set
runtime.is_websocket_endpoint=true
inconfig.yaml
when deploying a Truss model with a WebSocket.
- ⚠️ Development environments are not yet supported. You must promote your model to the
production
environment, e.g.truss push --promote
. - The model is available at this endpoint:
wss://model-<id>.api.baseten.co/v1/websocket
. - Continuous message exchange occurs in a loop until client disconnection. You can also decide to close the connection server-side if a certain condition is reached.
- WebSockets enable bidirectional streaming, avoiding the need for multiple HTTP requests (or return values).
- You must not implement any of the traditional methods
predict
,preprocess
,postprocess
. - The WebSocket object passed to the
websocket
-method has already accepted the connection, so you must not callwebsocket.accept()
on it. You may close the connection though at the end of your processing. If you don’t close it explicitly, it will be closed after exiting yourwebsocket
-method.
Invocation
Using websocat
(get it), you can call the model like this:
Deployment and Concurrency Considerations
Deployments
WebSockets are currently only supported in the production environment (not in other named environments). For now, promoting models directly to production is the most reliable way to develop with WebSockets. We’re actively working to get WebSockets to have parity with HTTP.
Scheduling
The default WebSocket scaling algorithm will schedule new WebSocket connections to the least-utilized replica until all replicas are at maxConcurrency - 1
concurrent WebSocket connections, at which point the total number of replicas will be incremented, until the maxReplica
setting is hit.
Scale-down occurs when the number of replicas is greater than minReplica
, and there are replicas with 0 concurrent connections. At this point, we begin scaling down idle replicas one-by-one.
Lifetime guarantees
WebSockets are guaranteed to last a minimum of 1 hour. In reality, a single WebSocket connection should be able to continue for much longer, but this is the guarantee that we provide in order to ensure that we can make changes to our system at a reasonable rate (including restarting and moving internal services as needed).
Concurrency changes
When scaling concurrency down, existing WebSockets will be allowed to continue until they complete, even if it means that a replica indefinitely has a greater number of ongoing connections than the max concurrency setting.
For instance, suppose:
- You have a concurrency setting of 10, and currently have 10 websocket connections active on a replica.
- Then, you change the concurrency setting to 5.
In this case, Baseten will not force any of the ongoing connections to close as a result of the concurrency change. They will be allowed to continue and close naturally (unless the 1 hour minimum has passed, and an internal restart is required).
Maximum message size
As a hard limit, we enforce a 100MiB maximum message size for any individual message sent over a websocket. This means that both clients and models are limited to 100MiB for each outgoing message, though there is no overall limit on the cumulative data that can be sent over a websocket.
Was this page helpful?