WebSockets provide a persistent, full-duplex communication channel between clients and server-side models or chains. Full duplex means chunks of data can flow client→server and server→client simultaneously and repeatedly, without reopening the connection. Use cases include real-time audio transcription, AI phone calls, and agents with turn-based interactions. WebSockets are also a fit when you need server-side state: requests in the same session always route to the replica holding that state.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
WebSockets in Truss models
A Truss WebSocket model implements a singlewebsocket method in place of the usual preprocess, predict, and postprocess methods. All input and output flows through the WebSocket object itself, not through arguments or return values. load still works as it does for HTTP models.
- Initialize your Truss:
truss init reference for full options.
- Replace the
predictmethod inmodel/model.pywith awebsocketmethod. For example:
- Set
runtime.transport.kind=websocketinconfig.yaml:
- Deploy the model:
truss push --watch.
For deployment options, see the truss push reference.
Constraints and behavior
- Message exchange runs in a loop until the client disconnects. To close the connection from the server, call
websocket.close(). - WebSockets support bidirectional streaming, so you don’t need multiple HTTP round-trips.
- Don’t implement
predict,preprocess, orpostprocess. Baseten doesn’t call them. - Baseten accepts the connection for you, so don’t call
websocket.accept(). You can close the connection yourself when you’re done; otherwise Baseten closes it after yourwebsocketmethod returns.
Call the model
Use websocat to call the model:The path depends on the environment or deployment you’re calling:
- Environment:
wss://model-{MODEL_ID}.api.baseten.co/environments/{ENVIRONMENT_NAME}/websocket - Deployment:
wss://model-{MODEL_ID}.api.baseten.co/deployment/{DEPLOYMENT_NAME}/websocket - Regional environment:
wss://model-{MODEL_ID}-{ENV_NAME}.api.baseten.co/websocket. See Regional environments.
WebSockets in Chains
Chains wrap WebSockets in a reducedWebSocketProtocol object. Processing happens in run_remote as usual, but inputs and outputs both flow through the WebSocket itself using async send_* and receive_* methods (text, bytes, and json variants). A convenience receive method handles both str and bytes.
Example chainlet
Constraints and behavior
- Your
run_remotesignature must useWebSocketProtocol. It mirrorsfastapi.WebSocket, except you can’t callaccept(). Baseten has already accepted the connection by the time your chainlet runs. run_remoteaccepts no other arguments when using WebSockets.- The return type must be
None. Send any data back to the client through the WebSocket instead. - WebSockets are only supported on the entrypoint chainlet, not on dependencies.
- Unlike Truss models, Chains don’t require you to set
runtime.transport.kind.
Call the chain
Use websocat to call the chain:Like models, chains accept WebSocket connections on either a deployment or environment path. For regional environments, use
wss://chain-{CHAIN_ID}-{ENV_NAME}.api.baseten.co/websocket. See Regional environments.See the WebSocket endpoint reference for full details.WebSockets with custom servers
Deploy a WebSocket server from a custom Docker image using thedocker_server configuration. This fits when you already have a WebSocket server packaged as a container, or when you need a runtime Baseten’s managed images don’t provide.
Configuration
Set the following inconfig.yaml:
config.yaml
Required fields
predict_endpoint: The WebSocket endpoint path on your server, for example/v1/websocketor/ws.runtime.transport.kind: Must be"websocket".start_command: Command that starts your WebSocket server.readiness_endpoint: HTTP path for readiness probes.liveness_endpoint: HTTP path for liveness probes.
Call the model
Use websocat to connect to your custom server:predict_endpoint path on your server.
For more on custom server deployment, see Custom servers.
Deployment and concurrency considerations
Scheduling
Baseten schedules new WebSocket connections onto the least-utilized replica until every replica holdsmaxConcurrency - 1 concurrent connections. At that point, Baseten adds replicas up to the maxReplica limit.
Baseten scales down when the replica count exceeds minReplica and at least one replica has zero connections. Idle replicas are removed one at a time.
Two factors matter more for WebSockets than for HTTP:
- Resource utilization: HTTP requests are stateless, so Baseten can rebalance them freely. WebSocket connections stay pinned to a replica for their lifetime and count against that replica’s concurrency target even when idle. Manage connection efficiency on the client side.
- Stateful complexity: WebSocket handlers often hold server-side state, which adds lifecycle work (disconnects, cleanup, reconnection logic).
Lifetime guarantees
Baseten guarantees every WebSocket connection lasts at least 1 hour. In practice, connections run much longer. The 1-hour floor exists so Baseten can restart and rebalance internal services without breaking long-lived sessions.Concurrency changes
LoweringmaxConcurrency doesn’t close existing connections. Open WebSockets keep running until they close naturally, even if a replica ends up above the new target.
For example, if a replica holds 10 active WebSockets and you change maxConcurrency from 10 to 5, Baseten leaves all 10 open. They drain naturally as clients disconnect, or when the 1-hour lifetime guarantee triggers an internal restart.
Promotion
You can promote a WebSocket model or chain to an environment through the REST API or UI, the same way you promote HTTP deployments. On promotion, Baseten routes new connections to the new deployment, but existing connections stay on the previous deployment until they terminate. This means older deployments can take longer to scale down than HTTP deployments: their connections outlive the promotion.Maximum message size
Baseten enforces a 100 MiB limit on individual messages sent over a WebSocket. Both clients and models are capped at 100 MiB per outgoing message. There’s no cap on the total data sent over a connection’s lifetime.Monitoring
WebSocket deployments expose the same performance metrics as HTTP deployments. The rest of this section covers the differences that matter: status codes reported on connection close, how connection duration is measured, and what counts toward input and output size.Inference volume
The Metrics page tracks inference volume as the number of connections per minute. Baseten publishes each data point after the connection closes, so every point carries the status the connection ended with. Two families of status codes appear for WebSocket deployments:- HTTP status codes for connections that failed before the WebSocket upgrade completed.
- WebSocket close codes for connections that completed the upgrade and later closed.
HTTP status codes
| Code | Label | What it means |
|---|---|---|
408 | Request timeout | The WebSocket upgrade request timed out before a replica accepted it. |
504 | Gateway timeout | No replica became available in time. Typically indicates a cold start that exceeded the configured timeout, or a saturated deployment. |
WebSocket close codes
RFC 6455 defines the full set of WebSocket close codes. The Metrics page surfaces this subset:| Code | Label | What it means |
|---|---|---|
1000 | Normal closure | Either side closed the connection cleanly. This is normal, expected traffic. |
1001 | Going away | One side is going away, for example a replica restarting or a browser navigating away. |
1002 | Protocol error | One side sent a frame that violates the WebSocket protocol. |
1003 | Unsupported data | One side received a frame type it cannot accept, for example binary data on a text-only endpoint. |
1005 | No status received | The connection closed without a status code. Reserved and not sent on the wire. |
1006 | Abnormal closure | The connection dropped without a close frame. Usually caused by a network failure or a replica crash. |
1007 | Invalid frame payload data | A message payload was inconsistent with its declared type, for example non-UTF-8 bytes in a text frame. |
1008 | Policy violation | One side closed the connection for a policy reason it did not want to publish. |
1009 | Message too big | A message exceeded the 100 MiB per-message limit. See Maximum message size. |
1010 | Mandatory extension missing | The client expected a WebSocket extension that the server did not negotiate. |
1011 | Internal error | The server side hit an unexpected error that forced the connection to close. Check your model logs. |
1012 | Service restart | The replica is restarting. |
1013 | Try again later | The replica is temporarily overloaded. |
1014 | Bad gateway | An upstream gateway returned an invalid response. |
1015 | TLS handshake failure | Reserved and not sent on the wire. |
The Metrics page filters out codes that don’t help with debugging model behavior, such as rate-limit responses and connections that never reached a terminal state. Grafana and other lower-level tools might show these codes anyway.
End-to-end connection duration
Duration is measured from when the connection opens to when it closes, and published after the connection ends. Reported at p50, p90, p95, and p99.Connection input and output size
Cumulative bytes transferred over the connection’s lifetime, reported at p50, p90, p95, and p99:- Connection input size: Bytes sent by the client to the server.
- Connection output size: Bytes sent by the server to the client.