Use this endpoint with the OpenAI Python client and any deployment of a compatable model deployed on Baseten. If you’re serving a vLLM model, this endpoint will support that model out of the box.

This endpoint aims to be at feature partity with OpenAI Chat Completions. If you find an unsupported feature, don’t hesitate to reach out!

Calling the model

https://bridge.baseten.co/v1/direct

Parameters

Parameters supported by the OpenAI ChatCompletions request can be found in the OpenAI documentation. Special attention should be paid to the model parameter:

model
string
required

A string formatted as baseten/{model_id}[/{deployment_id}]. deployment_id is optional. When deployment_id is not provided, the production deployment will be used.

Output

Streaming and non-streaming responses are supported. The vLLM OpenAI Server is a good example of how to serve your model results.

For streaming outputs, data format must comply with the Server-Side-Events (SSE) format. A helpful example for JSON payloads can be found here.

Best Practices

  • Pin your openai package version in your requirements.txt file. This helps avoid any breaking changes that get introduced through package upgrades
  • If you must make breaking changes to your truss server (i.e. to introduce a new feature), you should first publish a new model deployment then update your API call on the client side.