Skip to main content
This quickstart walks you through calling an LLM on Baseten using Model APIs. Sign up, create an API key, and make a chat completion request in just a few minutes with no model deployment required. Model APIs provide OpenAI-compatible endpoints for high-performance open-source models. If your code already works with the OpenAI SDK, it works with Baseten. Change the base URL and API key to start running inference.

Prerequisites

Set up your environment:

Run inference

Call a model using the OpenAI SDK. This example uses GLM-4.7, but you can substitute any model from the supported models list.
Install the OpenAI SDK if you don’t have it:
uv pip install openai
Create a chat completion:
chat.py
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ["BASETEN_API_KEY"],
)

response = client.chat.completions.create(
    model="zai-org/GLM-4.7",
    messages=[
        {"role": "user", "content": "What is inference in machine learning?"}
    ],
)

print(response.choices[0].message.content)
Success looks like this:
Inference in machine learning refers to the process of using a trained model
to make predictions or generate outputs from new input data...
That’s it. You’re running inference on Baseten.

Stream the response

For real-time applications, set stream: true to receive tokens as they’re generated:
stream.py
stream = client.chat.completions.create(
    model="zai-org/GLM-4.7",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."}
    ],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="")

Explore Model API features

Model APIs support the full OpenAI Chat Completions API. Constrain outputs to a JSON schema, let the model call functions you define, or enable extended thinking for complex tasks. See the Model APIs documentation for the full parameter reference and supported models.

Deploy your own model

Model APIs offer the fastest start, but when you need dedicated infrastructure or want to run a model Baseten doesn’t host, deploy your own with Truss. A config.yaml is all it takes. Point Truss at a Hugging Face model, choose a GPU, and run truss push:
config.yaml
model_name: Qwen-2.5-3B
resources:
  accelerator: L4
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: HF
      repo: "Qwen/Qwen2.5-3B-Instruct"
Baseten builds a TensorRT-optimized container and provides an OpenAI-compatible endpoint.

Deploy your first model

Walk through a full config-only deployment from scratch.

Choose an inference engine

Every deployment on Baseten uses an inference engine tuned for the model’s architecture. The engine handles quantization, tensor parallelism, KV cache management, and batching. Select the engine in your config.yaml, or Baseten selects it automatically based on the model.

Build multi-step workflows

Some applications need more than a single model call. A RAG pipeline retrieves documents, embeds them, and generates a response. An image generation workflow runs a diffusion model, upscales the result, and applies safety filtering. Chains orchestrates these multi-step pipelines, with each step running on its own hardware and scaling independently.

Get started with Chains

Build your first multi-step pipeline.

Train and fine-tune models

Baseten provides training infrastructure for fine-tuning and pre-training. Bring your training scripts (Axolotl, TRL, or custom code) and run jobs on H100 or H200 GPUs. Push a training job and deploy the result in two commands:
truss train push config.yaml
truss train deploy_checkpoints --training-job-id <job-id>

Get started with training

Run your first fine-tuning job and deploy the checkpoint.

Scale and monitor in production

Every deployment on Baseten runs on autoscaling infrastructure that adjusts replicas based on traffic. Models scale to zero when idle and scale up within seconds when requests arrive. Built-in observability gives you real-time metrics, logs, and request traces for every deployment.

Find your path

If you’re integrating a model into your application, start with Model APIs and explore the features that support production use cases.

Next steps