ChatCompletions
Use this endpoint with the OpenAI Python client and any deployment of a compatable model deployed on Baseten. If youβre serving a vLLM model in OpenAI compatible mode, this endpoint will support that model out of the box.
If your model does not have an OpenAI compatible mode, you can use the previous version of the bridge to make it compatible with OpenAIβs client, but with a more limited set of supported features.
Calling the model
Parameters
Parameters supported by the OpenAI ChatCompletions request can be found in the OpenAI documentation. Below are details about Baseten-specific arguments that must be passed into the bridge.
Typically Hugging Face repo name (e.g. meta-llama/Meta-Llama-3.1-70B-Instruct
). In some cases, it may be another default specified by your inference engine.
Python dictionary that enables extra arguments to be supplied to the chat completion request.
Baseten-specific parameters that should be passed to the bridge. The arguments should be passed as a dictionary.
The string identifier for the target model.
The string identifier for the target deployment. When deployment_id
is not provided, the production deployment will be used.
Output
Streaming and non-streaming responses are supported. The vLLM OpenAI Server is a good example of how to serve your model results.
For streaming outputs, data format must comply with the Server-Side-Events (SSE) format. A helpful example for JSON payloads can be found here.
Best Practices
- Pin your
openai
package version in your requirements.txt file. This helps avoid any breaking changes that get introduced through package upgrades - If you must make breaking changes to your truss server (i.e. to introduce a new feature), you should first publish a new model deployment then update your API call on the client side.