TrainingJob using Baseten Training. In this demo, we’ll create a finetuned revision
of OpenAI’s gpt-oss-20b!
Prerequisites
Before you begin, ensure you have the following:- Baseten Account: You’ll need an active Baseten account. If you don’t have one, please sign up on the Baseten web app.
- API Key: Obtain an API key for your Baseten account. This key is required to authenticate with the Baseten API and SDK.
- Truss SDK and CLI: The
trusspackage provides a python-native way for defining and running your training jobs. jobs. The CLI provides a convenient way to deploy and manage your training jobs. Install or update it: - Dependencies: In this demo, we’ll use Huggingface to access and upload models. It’s recommended that you create a Huggingface access token and add it to your Baseten Secrets. Additionally, it can be helpful to visualize your training run. In this example, we use Weights & Biases (wandb). This is optional.
Step 1: Define your training configuration
Optional: Initialize configuration with truss train init
To download everything described in this step, runTraining configuration details
The primary way to define your training jobs is through a Python configuration file, typically namedconfig.py. This file uses the truss package to specify all
aspects of your TrainingProject and TrainingJob.
A simple example of a config.py file is shown below:
config.py
Key considerations for your Baseten training configuration file
- Local Artifacts: If your training requires local scripts (like
a
train.pyor arun.sh), helper files, or configuration files (e.g., accelerate config), place them in the same directory as yourconfig.pyor in subdirectories. When you push the training job,trusswill package these artifacts and upload them. They will be copied into the container at the root of the base image’s working directory. - Ignore Folders and Files: You can exclude specific files from being pushed by creating a
.truss_ignorefile in root directory of your project. In this file, you can add entries in a style similar to.gitignore. Refer to the CLI reference for more details. - Secrets: Ensure any secrets referenced via
SecretReference(e.g.,hf_access_token,wandb_api_key) are defined in your Baseten workspace settings. - Private Images: You can deploy your jobs with private images by specifying a
DockerAuthin yourImageconfiguration. See our DockerAuth SDK for more details.
TrainingJob type, check out our SDK-reference.
What can I run in the start_commands?
In short, anything! Baseten Training is a framework-agnostic training platform. Any training framework and training methodology
is supported. Typically, a run.sh script is used. An example might look like this:
run.sh
train.py below.
Training Different Models
This recipe and more can be found at Baseten’s ML Cookbook. Clone the repo to get the starter code for this demo, along with other training and finetuning examples!Additional features
We’ve kept the above config simple to help you get off the ground - but there’s a lot more you can do Baseten Training:- Checkpointing - automatically save and deploy your model checkpoints.
- Training Cache - speed up training by caching data and models between jobs.
- Multinode Training - train on multiple GPU nodes to make the most out of your compute.
Step 2: Submit Your Training Job
Once yourconfig.py and any local artifacts are ready, you submit the training
job using the truss CLI:
- Parses your
config.py. - Packages any local files in the directory (and subdirectories) alongside
config.py. - Creates or updates the
TrainingProjectspecified in your config. - Submits the defined
TrainingJobunder that project.
https://app.baseten.co/training/
Next steps
- Basics: Learn about the fundamental building blocks of Baseten Training
- Cache: Speed up your training iterations with persistent caching
- Checkpointing: Manage model checkpoints seamlessly
- Multinode Training: Scale your training across multiple nodes
- Management: Learn how to check status, view logs and metrics, and stop jobs.