Use this endpoint with the OpenAI Python client and any deployment of a compatable model deployed on Baseten. If you’re serving a vLLM model in OpenAI compatible mode, this endpoint will support that model out of the box.

If your model does not have an OpenAI compatible mode, you can use the previous version of the bridge to make it compatible with OpenAI’s client, but with a more limited set of supported features.

Calling the model

https://bridge.baseten.co/v1/direct

Parameters

Parameters supported by the OpenAI ChatCompletions request can be found in the OpenAI documentation. Below are details about Baseten-specific arguments that must be passed into the bridge.

model
string
required

Typically Hugging Face repo name (e.g. meta-llama/Meta-Llama-3.1-70B-Instruct). In some cases, it may be another default specified by your inference engine.

extra_body
dict
required

Python dictionary that enables extra arguments to be supplied to the chat completion request.

extra_body.baseten
dict
required

Baseten-specific parameters that should be passed to the bridge. The arguments should be passed as a dictionary.

extra_body.baseten.model_id
string
required

The string identifier for the target model.

extra_body.baseten.deployment_id
string

The string identifier for the target deployment. When deployment_id is not provided, the production deployment will be used.

Output

Streaming and non-streaming responses are supported. The vLLM OpenAI Server is a good example of how to serve your model results.

For streaming outputs, data format must comply with the Server-Side-Events (SSE) format. A helpful example for JSON payloads can be found here.

Best Practices

  • Pin your openai package version in your requirements.txt file. This helps avoid any breaking changes that get introduced through package upgrades
  • If you must make breaking changes to your truss server (i.e. to introduce a new feature), you should first publish a new model deployment then update your API call on the client side.