Prerequisites
- A Baseten account with an API key
- Python 3.9+ (check with
python3 --version)
- uv (recommended)
- pip (macOS/Linux)
- pip (Windows)
Run inference
Call a model using the OpenAI SDK. This example uses GLM-4.7, but you can substitute any model from the supported models list.- Python
- JavaScript
- cURL
Install the OpenAI SDK if you don’t have it:Create a chat completion:
chat.py
Stream the response
For real-time applications, setstream: true to receive tokens as they’re generated:
- Python
- JavaScript
stream.py
Explore Model API features
Model APIs support the full OpenAI Chat Completions API. Constrain outputs to a JSON schema, let the model call functions you define, or enable extended thinking for complex tasks. See the Model APIs documentation for the full parameter reference and supported models.Structured outputs
Generate JSON that conforms to a schema you define.
Tool calling
Let the model invoke functions and use the results in its response.
Reasoning
Enable extended thinking for multi-step problem solving.
Deploy your own model
Model APIs offer the fastest start, but when you need dedicated infrastructure or want to run a model Baseten doesn’t host, deploy your own with Truss. Aconfig.yaml is all it takes. Point Truss at a Hugging Face model, choose a GPU, and run truss push:
config.yaml
Deploy your first model
Walk through a full config-only deployment from scratch.
Choose an inference engine
Every deployment on Baseten uses an inference engine tuned for the model’s architecture. The engine handles quantization, tensor parallelism, KV cache management, and batching. Select the engine in yourconfig.yaml, or Baseten selects it automatically based on the model.
Engine-Builder-LLM
Dense text generation models compiled with TensorRT-LLM. Supports lookahead decoding and structured outputs.
BIS-LLM
Large mixture-of-experts models like DeepSeek R1 and Qwen3 MoE with KV-aware routing and distributed inference.
BEI
Embedding, reranking, and classification models with up to 1,400 embeddings per second throughput.
Build multi-step workflows
Some applications need more than a single model call. A RAG pipeline retrieves documents, embeds them, and generates a response. An image generation workflow runs a diffusion model, upscales the result, and applies safety filtering. Chains orchestrates these multi-step pipelines, with each step running on its own hardware and scaling independently.Get started with Chains
Build your first multi-step pipeline.
Train and fine-tune models
Baseten provides training infrastructure for fine-tuning and pre-training. Bring your training scripts (Axolotl, TRL, or custom code) and run jobs on H100 or H200 GPUs. Push a training job and deploy the result in two commands:Get started with training
Run your first fine-tuning job and deploy the checkpoint.
Scale and monitor in production
Every deployment on Baseten runs on autoscaling infrastructure that adjusts replicas based on traffic. Models scale to zero when idle and scale up within seconds when requests arrive. Built-in observability gives you real-time metrics, logs, and request traces for every deployment.Autoscaling
Configure replicas, concurrency targets, and scale-to-zero.
Observability
Monitor performance with metrics, logs, and traces.
Find your path
- Build AI applications
- Deploy and optimize models
- Train and fine-tune
If you’re integrating a model into your application, start with Model APIs and explore the features that support production use cases.