POST /predict with arbitrary JSON. If you want your deployment to also support OpenAI-style requests, define chat_completions or completions on your Model class.
Use these methods when you want custom Python logic but still want clients to call your model through /v1/chat/completions or /v1/completions.
Which method to implement
| Method | Endpoint | Use it for |
|---|---|---|
chat_completions | /v1/chat/completions | Chat-style payloads with a messages array. |
completions | /v1/completions | Prompt-style payloads with a prompt field. |
chat_completions
Implementchat_completions when your model should accept OpenAI-compatible chat requests.
model/model.py
model_input typically includes fields like:
messagesmodelstream- sampling parameters such as
temperatureandmax_tokens
predict method that handles the same payload shape, chat_completions can simply delegate to it.
completions
Implementcompletions when your model should accept prompt-style completion requests.
model/model.py
completions for workloads such as autocomplete, prompt continuation, or fine-tuned models that are designed to extend text instead of following chat-style instructions.
Request and response expectations
- These methods receive the parsed JSON payload as
model_input. - If you include a second argument annotated as
fastapi.Request, you can inspect disconnects or request metadata just like inpredict. See Request handling. - Return JSON that matches the endpoint you expose. Baseten does not automatically convert an arbitrary
predictresponse into OpenAI response objects for custom model code.
Endpoint paths
When these methods are defined, your deployment can serve the matching OpenAI-style routes in addition to/predict.
POST /environments/{env}/sync/v1/chat/completionsPOST /environments/{env}/sync/v1/completions
{env} with production. For development deployments, use development.
When to use this vs. engine-based deployments
If you want Baseten to handle OpenAI-compatible serving, tokenization, and engine-level optimizations for popular LLMs, start with Your first model, Engine-Builder-LLM, or BIS-LLM. Use custom model code withchat_completions or completions when you need to:
- add custom preprocessing or postprocessing around an OpenAI-style API
- support a model architecture that is not covered by Baseten’s built-in engines
- keep an existing client contract while running your own Python inference logic