Training is currently in beta. To unlock, please request access.

Once you have submitted training jobs, Baseten provides tools to manage your TrainingProjects and individual TrainingJobs. You can use the CLI or the API to manage your jobs.

TrainingProject management

  • Listing Projects: To view all your training projects:

    truss train view
    

    This command will list all TrainingProjects you have access to, typically showing their names and IDs. Additionally, this command will show all active jobs.

  • Viewing Jobs within a Project: To see all jobs associated with a specific project, use its project-id (obtained when creating the project or from truss train view):

    truss train view --project-id <your_project_id>
    

TrainingJob management

After submitting a job with truss train push config.py, you receive a project_id and job_id.

  • Listing Jobs: As shown above, you can list all jobs within a project using:

    truss train view --project-id <your_project_id>
    

    This will typically show job IDs, statuses, creation times, etc.

  • Checking Status and Retrieving Logs: To view the logs for a specific job, you can tail them in real-time or fetch existing logs.

    • To view logs for the most recently submitted job in the current context (e.g., if you just pushed a job from your current terminal directory):
      truss train logs --tail
      
    • To view logs for a specific job using its job-id:
      truss train logs --job-id <your_job_id> [--tail]
      
      Add --tail to follow the logs live.
  • Understanding Job Statuses: The truss train view and truss train logs commands will help you track which status a job is in. For more on the job lifecycle, see the Lifecycle page.

  • Stopping a TrainingJob: If you need to stop a running job, use the stop command with the job’s project ID and job ID:

    truss train stop --job-id <your_job_id>
    truss train stop --all # Stops all active jobs; Will prompt the user for confirmation.
    

    This will transition the job to the TRAINING_JOB_STOPPED state.

  • Understanding Job Outputs & Checkpoints:

    • The primary outputs of a successful TrainingJob are model checkpoints (if checkpointing is enabled and configured).
    • These checkpoints are stored by Baseten. Refer to the Checkpointing section in Core Concepts for how CheckpointingConfig works.
    • When you are ready to deploy a model, you will specify which checkpoints to use. The model_name you assign during deployment (via DeployCheckpointsConfig) becomes the identifier for this trained model version derived from your specific job’s checkpoints.
    • You can see the available checkpoints for a job via the Training API.