Understanding the conceptual framework of Baseten Training for effective model development.
TrainingProject
sTrainingProject
is a lightweight organization tool to help you group different TrainingJob
s together.
While there a few technical details to consider, your team can use TrainingProject
s to facilitate collaboration and organization.
TrainingJob
TrainingProject
, the actual work of training a model happens within a TrainingJob
. Each TrainingJob
represents a single, complete execution of your training script with a specific configuration.
TrainingJob
is the fundamental unit of execution. It bundles together:
image
.compute
resources needed to run the job.runtime
configurations like startup commands and environment variables.TrainingJob
s while knowing that previous ones have been persisted on Baseten.TRAINING_JOB_CREATED
), to resources being set up (TRAINING_JOB_DEPLOYING
), to actively running your script (TRAINING_JOB_RUNNING
), and finally to a terminal state like TRAINING_JOB_COMPLETED
. More details on the job lifecycle can be found on the Lifecycle page.Runtime
/root/.cache/user_artifacts
, which can be accessed via the $BT_RW_CACHE_DIR
environment variable./root/.cache/user_artifacts
instead. However, if you need to access data mounted to /root/.cache/huggingface
for compatibility reasons, you can set enable_legacy_hf_cache=True
in your CacheConfig
. Note that this legacy option is not recommended for new projects.Checkpointing
provides seemless storage for checkpoints and a jumping off point for inference and eval.
CheckpointingConfig
to the Runtime
and set enabled
to True
$BT_CHECKPOINT_DIR
environment variable in your job’s environment. Ensure your code is writing checkpoints to the $BT_CHECKPOINT_DIR
.
Compute
resource in your TrainingJob
by setting the node_count
to the number of nodes you’d like to use (e.g. 2).SecretReference
SecretReference
for secure handling of secrets.
SecretReference
. The actual secret value is never exposed in your code.TrainingJob
with checkpointing enabled, produces one or more model artifacts.truss train deploy_checkpoint
to deploy a model from your most recent training job. You can read more about this at Deploying Trained Models.