Getting Started
Your first steps to creating and running training jobs on Baseten.
This guide will walk you through the initial setup and the process of submitting
your first TrainingJob
using Baseten Training.
Prerequisites
Before you begin, ensure you have the following:
- Baseten Account: You’ll need an active Baseten account. If you don’t have one, please sign up on the Baseten web app.
- API Key: Obtain an API key for your Baseten account. This key is required to authenticate with the Baseten API and SDK.
- Truss SDK and CLI: The
truss
package provides a python-native way for defining and running your training jobs. jobs. The CLI provides a convenient way to deploy and manage your training jobs. Install or update it:
Step 1: Define your Training Configuration
The primary way to define your training jobs is through a Python configuration
file, typically named config.py
. This file uses the truss
package to specify all
aspects of your TrainingProject
and TrainingJob
.
A simple example of a config.py
file is shown below:
Key considerations for your Baseten Training configuration file
- Local Artifacts: If your training requires local scripts (like
a
train.py
or arun.sh
), helper files, or configuration files (e.g., accelerate config), place them in the same directory as yourconfig.py
or in subdirectories. When you push the training job,truss
will package these artifacts and upload them. They will be copied into the container at the root of the base image’s working directory. - Secrets: Ensure any secrets referenced via
SecretReference
(e.g.,hf_access_token
,wandb_api_key
) are defined in your Baseten workspace settings.
For a complete guide on the TrainingJob
type, check out our SDK-reference.
What can I run in the start_commands
?
In short, anything! Baseten Training is a framework-agnostic training platform. Any training framework and training methodology
is supported. Typically, a run.sh
script is used. An example might look like this:
Additional features
We’ve kept the above config simple to help you get off the ground - but there’s a lot more you can do Baseten Training:
- Checkpointing - automatically save and deploy your model checkpoints.
- Training Cache - speed up training by caching data and models between jobs.
- Multinode - train on multiple GPU nodes to make the most out of your compute.
Step 2: Submit Your Training Job
Once your config.py
and any local artifacts are ready, you submit the training
job using the truss
CLI:
This command does the following:
- Parses your
config.py
. - Packages any local files in the directory (and subdirectories) alongside
config.py
. - Creates or updates the
TrainingProject
specified in your config. - Submits the defined
TrainingJob
under that project.
Upon successful submission, the CLI will output helpful information about your job:
Keep the Job ID handy, as you’ll use it for managing and monitoring your job.
Next Steps
- Core Concepts: Deepen your understanding of Baseten
Training and explore key features like
CheckpointingConfig
, Training Cache, and Multinode. - Management: Learn how to check status, view logs and metrics, and stop jobs.