Skip to main content
Custom Truss models normally serve POST /predict with arbitrary JSON. If you want your deployment to also support additional HTTP routes, define the matching methods on your Model class. This page covers the built-in HTTP routes available from standard Truss model code. If you deploy a custom Docker container, Baseten can forward requests to any route exposed by the underlying server. See Custom Docker containers. Use these methods when you want custom Python logic but still want clients to call your model through the server’s built-in HTTP endpoints.

Which method to implement

MethodEndpointUse it for
chat_completions/v1/chat/completionsChat-style payloads with a messages array.
completions/v1/completionsPrompt-style payloads with a prompt field.
embeddings/v1/embeddingsEmbedding requests from text or token inputs.
messages/v1/messagesServer-specific message payloads exposed by your deployment.
responses/v1/responsesServer-specific response payloads exposed by your deployment.
You can implement any subset of these methods, depending on the interface you want to expose.

API families

EndpointFamily
/v1/chat/completionsOpenAI-style chat completions
/v1/completionsOpenAI-style text completions
/v1/embeddingsOpenAI-style embeddings
/v1/responsesOpenAI-style responses
/v1/messagesAnthropic-style messages
This page uses HTTP endpoints as the umbrella term because Truss can expose endpoints from more than one API family.

chat_completions

Implement chat_completions when your model should accept chat requests.
model/model.py
from typing import Any, Dict

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        pass

    async def predict(self, model_input: Dict[str, Any]):
        return {"output": model_input}

    async def chat_completions(self, model_input: Dict[str, Any], request):
        # Reuse your main inference path so /predict and /v1/chat/completions stay aligned.
        return await self.predict(model_input)
The request body follows the chat schema, so model_input typically includes fields like:
  • messages
  • model
  • stream
  • sampling parameters such as temperature and max_tokens
If you already have a predict method that handles the same payload shape, chat_completions can simply delegate to it.

completions

Implement completions when your model should accept prompt-style completion requests.
model/model.py
from typing import Any, Dict

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        pass

    async def completions(self, model_input: Dict[str, Any], request):
        prompt = model_input["prompt"]
        return {
            "id": "cmpl-example",
            "object": "text_completion",
            "choices": [
                {
                    "index": 0,
                    "text": f"You sent: {prompt}",
                    "finish_reason": "stop",
                }
            ],
        }
Use completions for workloads such as autocomplete, prompt continuation, or fine-tuned models that are designed to extend text instead of following chat-style instructions.

embeddings, messages, and responses

Implement embeddings, messages, or responses when your deployment should expose those HTTP endpoints from custom model code.
model/model.py
from typing import Any, Dict

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        pass

    def embeddings(self, model_input: Dict[str, Any], request):
        return {"output": "embeddings"}

    def messages(self, model_input: Dict[str, Any], request):
        return {"output": "messages"}

    def responses(self, model_input: Dict[str, Any], request):
        return {"output": "responses"}
These methods are forwarded directly to the matching /v1/* route, so your implementation can return whatever JSON shape that endpoint expects. messages maps to the Anthropic-style /v1/messages route. embeddings and responses map to OpenAI-style /v1/embeddings and /v1/responses routes.

Request and response expectations

  • These methods receive the parsed JSON payload as model_input.
  • If you include a second argument annotated as fastapi.Request, you can inspect disconnects or request metadata just like in predict. See Request handling.
  • Return JSON that matches the endpoint you expose. Baseten does not automatically convert an arbitrary predict response into a different response object for custom model code.

Endpoint paths

When these methods are defined, your deployment can serve the matching HTTP routes in addition to /predict.
/environments/{env}/sync/v1/chat/completions
/environments/{env}/sync/v1/completions
/environments/{env}/sync/v1/embeddings
/environments/{env}/sync/v1/messages
/environments/{env}/sync/v1/responses
For production, replace {env} with production. For development deployments, use development.