Skip to main content
Custom Truss models normally serve POST /predict with arbitrary JSON. If you want your deployment to also support OpenAI-style requests, define chat_completions or completions on your Model class. Use these methods when you want custom Python logic but still want clients to call your model through /v1/chat/completions or /v1/completions.

Which method to implement

MethodEndpointUse it for
chat_completions/v1/chat/completionsChat-style payloads with a messages array.
completions/v1/completionsPrompt-style payloads with a prompt field.
You can implement either method, or both, depending on the interface you want to expose.

chat_completions

Implement chat_completions when your model should accept OpenAI-compatible chat requests.
model/model.py
from typing import Any, Dict

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        pass

    async def predict(self, model_input: Dict[str, Any]):
        return {"output": model_input}

    async def chat_completions(self, model_input: Dict[str, Any], request):
        # Reuse your main inference path so /predict and /v1/chat/completions stay aligned.
        return await self.predict(model_input)
The request body follows the OpenAI chat schema, so model_input typically includes fields like:
  • messages
  • model
  • stream
  • sampling parameters such as temperature and max_tokens
If you already have a predict method that handles the same payload shape, chat_completions can simply delegate to it.

completions

Implement completions when your model should accept prompt-style completion requests.
model/model.py
from typing import Any, Dict

class Model:
    def __init__(self, **kwargs):
        pass

    def load(self):
        pass

    async def completions(self, model_input: Dict[str, Any], request):
        prompt = model_input["prompt"]
        return {
            "id": "cmpl-example",
            "object": "text_completion",
            "choices": [
                {
                    "index": 0,
                    "text": f"You sent: {prompt}",
                    "finish_reason": "stop",
                }
            ],
        }
Use completions for workloads such as autocomplete, prompt continuation, or fine-tuned models that are designed to extend text instead of following chat-style instructions.

Request and response expectations

  • These methods receive the parsed JSON payload as model_input.
  • If you include a second argument annotated as fastapi.Request, you can inspect disconnects or request metadata just like in predict. See Request handling.
  • Return JSON that matches the endpoint you expose. Baseten does not automatically convert an arbitrary predict response into OpenAI response objects for custom model code.

Endpoint paths

When these methods are defined, your deployment can serve the matching OpenAI-style routes in addition to /predict.
  • POST /environments/{env}/sync/v1/chat/completions
  • POST /environments/{env}/sync/v1/completions
For production, replace {env} with production. For development deployments, use development.

When to use this vs. engine-based deployments

If you want Baseten to handle OpenAI-compatible serving, tokenization, and engine-level optimizations for popular LLMs, start with Your first model, Engine-Builder-LLM, or BIS-LLM. Use custom model code with chat_completions or completions when you need to:
  • add custom preprocessing or postprocessing around an OpenAI-style API
  • support a model architecture that is not covered by Baseten’s built-in engines
  • keep an existing client contract while running your own Python inference logic