TrainingJob
using Baseten Training. In this demo, we’ll create a finetuned revision
of OpenAI’s gpt-oss-20b
!
Prerequisites
Before you begin, ensure you have the following:- Baseten Account: You’ll need an active Baseten account. If you don’t have one, please sign up on the Baseten web app.
- API Key: Obtain an API key for your Baseten account. This key is required to authenticate with the Baseten API and SDK.
- Truss SDK and CLI: The
truss
package provides a python-native way for defining and running your training jobs. jobs. The CLI provides a convenient way to deploy and manage your training jobs. Install or update it: - Dependencies: In this demo, we’ll use Huggingface to access and upload models. It’s recommended that you create a Huggingface access token and add it to your Baseten Secrets. Additionally, it can be helpful to visualize your training run. In this example, we use Weights & Biases (wandb). This is optional.
Step 1: Define your training configuration
The primary way to define your training jobs is through a Python configuration file, typically namedconfig.py
. This file uses the truss
package to specify all
aspects of your TrainingProject
and TrainingJob
.
A simple example of a config.py
file is shown below:
config.py
Key considerations for your Baseten training configuration file
- Local Artifacts: If your training requires local scripts (like
a
train.py
or arun.sh
), helper files, or configuration files (e.g., accelerate config), place them in the same directory as yourconfig.py
or in subdirectories. When you push the training job,truss
will package these artifacts and upload them. They will be copied into the container at the root of the base image’s working directory. - Ignore Folders and Files: You can exclude specific files from being pushed by creating a
.truss_ignore
file in root directory of your project. In this file, you can add entries in a style similar to.gitignore
. Refer to the CLI reference for more details. - Secrets: Ensure any secrets referenced via
SecretReference
(e.g.,hf_access_token
,wandb_api_key
) are defined in your Baseten workspace settings. - Private Images: You can deploy your jobs with private images by specifying a
DockerAuth
in yourImage
configuration. See our DockerAuth SDK for more details.
TrainingJob
type, check out our SDK-reference.
What can I run in the start_commands
?
In short, anything! Baseten Training is a framework-agnostic training platform. Any training framework and training methodology
is supported. Typically, a run.sh
script is used. An example might look like this:
run.sh
train.py
below.
Trainig Different Models
This recipe and more can be found at Baseten’s ML Cookbook. Clone the repo to get the starter code for this demo, along with other training and finetuning examples!Additional features
We’ve kept the above config simple to help you get off the ground - but there’s a lot more you can do Baseten Training:- Checkpointing - automatically save and deploy your model checkpoints.
- Training Cache - speed up training by caching data and models between jobs.
- Multinode - train on multiple GPU nodes to make the most out of your compute.
Step 2: Submit Your Training Job
Once yourconfig.py
and any local artifacts are ready, you submit the training
job using the truss
CLI:
- Parses your
config.py
. - Packages any local files in the directory (and subdirectories) alongside
config.py
. - Creates or updates the
TrainingProject
specified in your config. - Submits the defined
TrainingJob
under that project.
https://app.baseten.co/training/
Next steps
- Core Concepts: Deepen your understanding of Baseten
Training and explore key features like
CheckpointingConfig
, Training Cache, and Multinode. - Management: Learn how to check status, view logs and metrics, and stop jobs.