Skip to main content
This quickstart walks you through calling an LLM on Baseten using Model APIs. Sign up, create an API key, and make a chat completion request in just a few minutes with no model deployment required. Model APIs provide OpenAI-compatible endpoints for high-performance open-source models. If your code already works with the OpenAI SDK, it works with Baseten. Change the base URL and API key to start running inference.

Prerequisites

Set up your environment:

Run inference

Call a model using the OpenAI SDK. This example uses GLM-4.7, but you can substitute any model from the supported models list.
Install the OpenAI SDK if you don’t have it:
uv pip install openai
Create a chat completion:
chat.py
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="zai-org/GLM-4.7",
    messages=[
        {"role": "user", "content": "What is inference in machine learning?"}
    ],
)

print(response.choices[0].message.content)
Success looks like this:
Inference in machine learning refers to the process of using a trained model
to make predictions or generate outputs from new input data...
That’s it. You’re running inference on Baseten.

Stream the response

For real-time applications, set stream: true to receive tokens as they’re generated:
stream.py
stream = client.chat.completions.create(
    model="zai-org/GLM-4.7",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."}
    ],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="")

Explore Model API features

Model APIs support the full OpenAI Chat Completions API. Constrain outputs to a JSON schema, let the model call functions you define, or enable extended thinking for complex tasks. See the Model APIs documentation for the full parameter reference and supported models.

Structured outputs

Generate JSON that conforms to a schema you define.

Tool calling

Let the model invoke functions and use the results in its response.

Reasoning

Enable extended thinking for multi-step problem solving.

Next steps

Platform overview

Deploy models, run multi-step pipelines, train and fine-tune — see everything Baseten offers.

Deploy your first model

Go beyond Model APIs with a config-only Truss deployment on dedicated GPUs.