POST /predict with arbitrary JSON, and your predict method can return a single JSON response or a generator that streams output as it’s produced. You can also expose OpenAI- and Anthropic-style /v1 endpoints by implementing the matching methods, and access the raw request object when you need to customize deserialization or cancel long-running predictions.
Streaming
Streaming returns results as they’re generated instead of waiting for the full response, which cuts wait time for generative models.- Faster response time: Get initial results in under 1 second instead of waiting 10 or more seconds.
- Improved user experience: Partial outputs are immediately usable.
predict that yields chunks as they’re produced. The following sections walk through deploying Falcon 7B with streaming enabled.
Initialize Truss
Implement the model without streaming
This first version loads the Falcon 7B model without streaming:model/model.py
Add streaming support
To enable streaming:- Use
TextIteratorStreamerto stream tokens as they’re generated. - Run
generate()in a separate thread to prevent blocking. - Return a generator that streams results.
model/model.py
Configure config.yaml
config.yaml
Deploy and invoke
Deploy the model:/v1 endpoints
Custom Truss models normally servePOST /predict with arbitrary JSON. To also support additional HTTP routes, define the matching methods on your Model class. Use these methods when you want custom Python logic but still want clients to call your model through the server’s built-in HTTP endpoints.
If you deploy a custom Docker container, Baseten can forward requests to any route exposed by the underlying server. See Custom Docker containers.
Which method to implement
| Method | Endpoint | Use it for |
|---|---|---|
chat_completions | /v1/chat/completions | Chat-style payloads with a messages array. |
completions | /v1/completions | Prompt-style payloads with a prompt field. |
embeddings | /v1/embeddings | Embedding requests from text or token inputs. |
messages | /v1/messages | Server-specific message payloads exposed by your deployment. |
responses | /v1/responses | Server-specific response payloads exposed by your deployment. |
API families
| Endpoint | Family |
|---|---|
/v1/chat/completions | OpenAI-style chat completions |
/v1/completions | OpenAI-style text completions |
/v1/embeddings | OpenAI-style embeddings |
/v1/responses | OpenAI-style responses |
/v1/messages | Anthropic-style messages |
chat_completions
Implementchat_completions when your model should accept chat requests.
model/model.py
model_input typically includes fields like:
messagesmodelstream- sampling parameters such as
temperatureandmax_tokens
predict method that handles the same payload shape, chat_completions can simply delegate to it.
completions
Implementcompletions when your model should accept prompt-style completion requests.
model/model.py
completions for workloads such as autocomplete, prompt continuation, or fine-tuned models that are designed to extend text instead of following chat-style instructions.
embeddings, messages, and responses
Implementembeddings, messages, or responses when your deployment should expose those HTTP endpoints from custom model code.
model/model.py
/v1/* route, so your implementation can return whatever JSON shape that endpoint expects.
messages maps to the Anthropic-style /v1/messages route. embeddings and responses map to OpenAI-style /v1/embeddings and /v1/responses routes.
Request and response expectations
- These methods receive the parsed JSON payload as
model_input. - If you include a second argument annotated as
fastapi.Request, you can inspect disconnects or request metadata just like inpredict. See Request handling. - Return JSON that matches the endpoint you expose. Baseten does not automatically convert an arbitrary
predictresponse into a different response object for custom model code.
Endpoint paths
When these methods are defined, your deployment serves the matching HTTP routes in addition to/predict.
{env} with production. For development deployments, use development.
Request handling
Truss extracts and validates payloads for you. Access the raw request object when you need to:- Customize payload deserialization, for example binary protocol buffers.
- Handle disconnections and cancel long-running predictions.
Use request objects in Truss
You can define request objects inpreprocess, predict, and postprocess:
Rules for using requests
- The request must be type-annotated as
fastapi.Request. - If you use only the request, Truss skips payload extraction for better performance.
- If you use both the request and standard inputs:
- The request must be the second argument.
- Preprocessing transforms the inputs, but the request object stays unchanged.
postprocesscan’t take only the request; it must receive the model’s output.- If
predictuses only the request, you can’t usepreprocess.
Cancel requests in specific frameworks
TRT-LLM (polling-based cancellation)
For TensorRT-LLM, useresponse_iterator.cancel() to terminate streaming requests:
See full example in TensorRT-LLM Docs.
vLLM (abort API)
For vLLM, useengine.abort() to stop processing:
See full example in vLLM Docs.
Unsupported request features
- Streaming file uploads: Use URLs instead of embedding large data in the request.
- Client-side headers: Most headers are stripped; include necessary metadata in the payload.
Related pages
- The Model class: Write the
predict,chat_completions, and request-handling methods these endpoints call. - Custom Docker servers: Forward requests to any route your own container exposes.