Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

WebSockets provide a persistent, full-duplex communication channel between clients and server-side models or chains. Full duplex means chunks of data can flow client→server and server→client simultaneously and repeatedly, without reopening the connection. Use cases include real-time audio transcription, AI phone calls, and agents with turn-based interactions. WebSockets are also a fit when you need server-side state: requests in the same session always route to the replica holding that state.

WebSockets in Truss models

A Truss WebSocket model implements a single websocket method in place of the usual preprocess, predict, and postprocess methods. All input and output flows through the WebSocket object itself, not through arguments or return values. load still works as it does for HTTP models.
  1. Initialize your Truss:
truss init websocket-model
See the truss init reference for full options.
  1. Replace the predict method in model/model.py with a websocket method. For example:
import fastapi

class Model:
    async def websocket(self, websocket: fastapi.WebSocket):
        try:
            while True:
                message = await websocket.receive_text()
                await websocket.send_text(f"WS obtained: {message}")
        except fastapi.WebSocketDisconnect:
            pass
  1. Set runtime.transport.kind=websocket in config.yaml:
...
runtime:
  transport:
    kind: websocket
  1. Deploy the model:
truss push
This creates a published deployment. For live-reload during development, use truss push --watch. For deployment options, see the truss push reference.

Constraints and behavior

  • Message exchange runs in a loop until the client disconnects. To close the connection from the server, call websocket.close().
  • WebSockets support bidirectional streaming, so you don’t need multiple HTTP round-trips.
  • Don’t implement predict, preprocess, or postprocess. Baseten doesn’t call them.
  • Baseten accepts the connection for you, so don’t call websocket.accept(). You can close the connection yourself when you’re done; otherwise Baseten closes it after your websocket method returns.

Call the model

Use websocat to call the model:
websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
   wss://model-{MODEL_ID}.api.baseten.co/environments/production/websocket
Hello # Your input.
WS obtained: Hello # Echoed from model.
# ctrl+c to close connection.
The path depends on the environment or deployment you’re calling:
  • Environment: wss://model-{MODEL_ID}.api.baseten.co/environments/{ENVIRONMENT_NAME}/websocket
  • Deployment: wss://model-{MODEL_ID}.api.baseten.co/deployment/{DEPLOYMENT_NAME}/websocket
  • Regional environment: wss://model-{MODEL_ID}-{ENV_NAME}.api.baseten.co/websocket. See Regional environments.
See the WebSocket endpoint reference for full details.

WebSockets in Chains

Chains wrap WebSockets in a reduced WebSocketProtocol object. Processing happens in run_remote as usual, but inputs and outputs both flow through the WebSocket itself using async send_* and receive_* methods (text, bytes, and json variants). A convenience receive method handles both str and bytes.

Example chainlet

import fastapi
import truss_chains as chains

class Dependency(chains.ChainletBase):
    async def run_remote(self, name: str) -> str:
        return f"Hello from dependency, {name}."

@chains.mark_entrypoint
class WSEntrypoint(chains.ChainletBase):
    def __init__(self, dependency=chains.depends(Dependency)):
        self._dependency = dependency

    async def run_remote(self, websocket: chains.WebSocketProtocol) -> None:
        try:
            while True:
                message = await websocket.receive_text()
                if message == "dep":
                    response = await self._dependency.run_remote("WSEntrypoint")
                else:
                    response = f"You said: {message}"
                await websocket.send_text(response)
        except fastapi.WebSocketDisconnect:
            print("Disconnected.")

Constraints and behavior

  • Your run_remote signature must use WebSocketProtocol. It mirrors fastapi.WebSocket, except you can’t call accept(). Baseten has already accepted the connection by the time your chainlet runs.
  • run_remote accepts no other arguments when using WebSockets.
  • The return type must be None. Send any data back to the client through the WebSocket instead.
  • WebSockets are only supported on the entrypoint chainlet, not on dependencies.
  • Unlike Truss models, Chains don’t require you to set runtime.transport.kind.

Call the chain

Use websocat to call the chain:
websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
   wss://chain-{CHAIN_ID}.api.baseten.co/environments/production/websocket
Like models, chains accept WebSocket connections on either a deployment or environment path. For regional environments, use wss://chain-{CHAIN_ID}-{ENV_NAME}.api.baseten.co/websocket. See Regional environments.See the WebSocket endpoint reference for full details.

WebSockets with custom servers

Deploy a WebSocket server from a custom Docker image using the docker_server configuration. This fits when you already have a WebSocket server packaged as a container, or when you need a runtime Baseten’s managed images don’t provide.

Configuration

Set the following in config.yaml:
config.yaml
base_image:
  image: bryanzhang2/custom_ws:v0.0.4
docker_server:
  start_command: /app/server
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/websocket
  server_port: 8081
model_name: custom_ws
runtime:
  transport:
    kind: "websocket"

Required fields

  • predict_endpoint: The WebSocket endpoint path on your server, for example /v1/websocket or /ws.
  • runtime.transport.kind: Must be "websocket".
  • start_command: Command that starts your WebSocket server.
  • readiness_endpoint: HTTP path for readiness probes.
  • liveness_endpoint: HTTP path for liveness probes.

Call the model

Use websocat to connect to your custom server:
websocat -H="Authorization: Api-Key $BASETEN_API_KEY" \
   wss://model-{MODEL_ID}.api.baseten.co/environments/production/websocket
Baseten routes the connection to the predict_endpoint path on your server.
For more on custom server deployment, see Custom servers.

Deployment and concurrency considerations

Scheduling

Baseten schedules new WebSocket connections onto the least-utilized replica until every replica holds maxConcurrency - 1 concurrent connections. At that point, Baseten adds replicas up to the maxReplica limit. Baseten scales down when the replica count exceeds minReplica and at least one replica has zero connections. Idle replicas are removed one at a time. Two factors matter more for WebSockets than for HTTP:
  • Resource utilization: HTTP requests are stateless, so Baseten can rebalance them freely. WebSocket connections stay pinned to a replica for their lifetime and count against that replica’s concurrency target even when idle. Manage connection efficiency on the client side.
  • Stateful complexity: WebSocket handlers often hold server-side state, which adds lifecycle work (disconnects, cleanup, reconnection logic).

Lifetime guarantees

Baseten guarantees every WebSocket connection lasts at least 1 hour. In practice, connections run much longer. The 1-hour floor exists so Baseten can restart and rebalance internal services without breaking long-lived sessions.

Concurrency changes

Lowering maxConcurrency doesn’t close existing connections. Open WebSockets keep running until they close naturally, even if a replica ends up above the new target. For example, if a replica holds 10 active WebSockets and you change maxConcurrency from 10 to 5, Baseten leaves all 10 open. They drain naturally as clients disconnect, or when the 1-hour lifetime guarantee triggers an internal restart.

Promotion

You can promote a WebSocket model or chain to an environment through the REST API or UI, the same way you promote HTTP deployments. On promotion, Baseten routes new connections to the new deployment, but existing connections stay on the previous deployment until they terminate. This means older deployments can take longer to scale down than HTTP deployments: their connections outlive the promotion.

Maximum message size

Baseten enforces a 100 MiB limit on individual messages sent over a WebSocket. Both clients and models are capped at 100 MiB per outgoing message. There’s no cap on the total data sent over a connection’s lifetime.

Monitoring

WebSocket deployments expose the same performance metrics as HTTP deployments. The rest of this section covers the differences that matter: status codes reported on connection close, how connection duration is measured, and what counts toward input and output size.

Inference volume

The Metrics page tracks inference volume as the number of connections per minute. Baseten publishes each data point after the connection closes, so every point carries the status the connection ended with. Two families of status codes appear for WebSocket deployments:
  • HTTP status codes for connections that failed before the WebSocket upgrade completed.
  • WebSocket close codes for connections that completed the upgrade and later closed.

HTTP status codes

CodeLabelWhat it means
408Request timeoutThe WebSocket upgrade request timed out before a replica accepted it.
504Gateway timeoutNo replica became available in time. Typically indicates a cold start that exceeded the configured timeout, or a saturated deployment.

WebSocket close codes

RFC 6455 defines the full set of WebSocket close codes. The Metrics page surfaces this subset:
CodeLabelWhat it means
1000Normal closureEither side closed the connection cleanly. This is normal, expected traffic.
1001Going awayOne side is going away, for example a replica restarting or a browser navigating away.
1002Protocol errorOne side sent a frame that violates the WebSocket protocol.
1003Unsupported dataOne side received a frame type it cannot accept, for example binary data on a text-only endpoint.
1005No status receivedThe connection closed without a status code. Reserved and not sent on the wire.
1006Abnormal closureThe connection dropped without a close frame. Usually caused by a network failure or a replica crash.
1007Invalid frame payload dataA message payload was inconsistent with its declared type, for example non-UTF-8 bytes in a text frame.
1008Policy violationOne side closed the connection for a policy reason it did not want to publish.
1009Message too bigA message exceeded the 100 MiB per-message limit. See Maximum message size.
1010Mandatory extension missingThe client expected a WebSocket extension that the server did not negotiate.
1011Internal errorThe server side hit an unexpected error that forced the connection to close. Check your model logs.
1012Service restartThe replica is restarting.
1013Try again laterThe replica is temporarily overloaded.
1014Bad gatewayAn upstream gateway returned an invalid response.
1015TLS handshake failureReserved and not sent on the wire.
For the full specification, see CloseEvent codes on MDN.
The Metrics page filters out codes that don’t help with debugging model behavior, such as rate-limit responses and connections that never reached a terminal state. Grafana and other lower-level tools might show these codes anyway.

End-to-end connection duration

Duration is measured from when the connection opens to when it closes, and published after the connection ends. Reported at p50, p90, p95, and p99.

Connection input and output size

Cumulative bytes transferred over the connection’s lifetime, reported at p50, p90, p95, and p99:
  • Connection input size: Bytes sent by the client to the server.
  • Connection output size: Bytes sent by the server to the client.