OpenAI-compatible endpoints

Custom Truss models normally serve POST /predict with arbitrary JSON. If you want your deployment to also support OpenAI-style requests, define chat_completions or completions on your Model class. Use these methods when you want custom Python logic but still want clients to call your model through /v1/chat/completions or /v1/completions.

Which method to implement

Method	Endpoint	Use it for
`chat_completions`	`/v1/chat/completions`	Chat-style payloads with a `messages` array.
`completions`	`/v1/completions`	Prompt-style payloads with a `prompt` field.

You can implement either method, or both, depending on the interface you want to expose.

chat_completions

Implement chat_completions when your model should accept OpenAI-compatible chat requests.

model/model.py

from typing import Any, Dict

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        pass

    async def predict(self, model_input: Dict[str, Any]):
        return {"output": model_input}

    async def chat_completions(self, model_input: Dict[str, Any], request):
        # Reuse your main inference path so /predict and /v1/chat/completions stay aligned.
        return await self.predict(model_input)

The request body follows the OpenAI chat schema, so model_input typically includes fields like:

messages
model
stream
sampling parameters such as temperature and max_tokens

If you already have a predict method that handles the same payload shape, chat_completions can simply delegate to it.

completions

Implement completions when your model should accept prompt-style completion requests.

model/model.py

from typing import Any, Dict

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        pass

    async def completions(self, model_input: Dict[str, Any], request):
        prompt = model_input["prompt"]
        return {
            "id": "cmpl-example",
            "object": "text_completion",
            "choices": [
                {
                    "index": 0,
                    "text": f"You sent: {prompt}",
                    "finish_reason": "stop",
                }
            ],
        }

Use completions for workloads such as autocomplete, prompt continuation, or fine-tuned models that are designed to extend text instead of following chat-style instructions.

Request and response expectations

These methods receive the parsed JSON payload as model_input.
If you include a second argument annotated as fastapi.Request, you can inspect disconnects or request metadata just like in predict. See Request handling.
Return JSON that matches the endpoint you expose. Baseten does not automatically convert an arbitrary predict response into OpenAI response objects for custom model code.

Endpoint paths

When these methods are defined, your deployment can serve the matching OpenAI-style routes in addition to /predict.

POST /environments/{env}/sync/v1/chat/completions
POST /environments/{env}/sync/v1/completions

For production, replace {env} with production. For development deployments, use development.

When to use this vs. engine-based deployments

If you want Baseten to handle OpenAI-compatible serving, tokenization, and engine-level optimizations for popular LLMs, start with Your first model, Engine-Builder-LLM, or BIS-LLM. Use custom model code with chat_completions or completions when you need to:

add custom preprocessing or postprocessing around an OpenAI-style API
support a model architecture that is not covered by Baseten’s built-in engines
keep an existing client contract while running your own Python inference logic

Get started

About Baseten

Inference

Development

Deployment

Engines

Training

Organization

Observability

Troubleshooting

OpenAI-compatible endpoints

Which method to implement

chat_completions

completions

Request and response expectations

Endpoint paths

When to use this vs. engine-based deployments

Get started

About Baseten

Inference

Development

Deployment

Engines

Training

Organization

Observability

Troubleshooting

​Which method to implement

​chat_completions

​completions

​Request and response expectations

​Endpoint paths

​When to use this vs. engine-based deployments

​Related pages

Which method to implement

chat_completions

completions

Request and response expectations

Endpoint paths

When to use this vs. engine-based deployments

Related pages