Once you have submitted training jobs, Baseten provides tools to manage your TrainingProjects and individual TrainingJobs. You can use the CLI or the API to manage your jobs.

TrainingProject Management

  • Listing Projects: To view all your training projects:

    truss train view
    

    This command will list all TrainingProjects you have access to, typically showing their names and IDs. Additionally, this command will show all active jobs.

  • Viewing Jobs within a Project: To see all jobs associated with a specific project, use its project-id (obtained when creating the project or from truss train view):

    truss train view --project-id <your_project_id>
    

TrainingJob Management

After submitting a job with truss train push config.py, you receive a project_id and job_id.

  • Listing Jobs: As shown above, you can list all jobs within a project using:

    truss train view --project-id <your_project_id>
    

    This will typically show job IDs, statuses, creation times, etc.

  • Checking Status and Retrieving Logs: To view the logs for a specific job, you can tail them in real-time or fetch existing logs.

    • To view logs for the most recently submitted job in the current context (e.g., if you just pushed a job from your current terminal directory):
      truss train logs --tail
      
    • To view logs for a specific job using its job-id:
      truss train logs --job-id <your_job_id> [--tail]
      
      Add --tail to follow the logs live.
  • Understanding Job Statuses: The truss train view and truss train logs commands will help you track which status a job is in. For more on the job lifecycle, see the Lifecycle page.

  • Stopping a TrainingJob: If you need to stop a running job, use the stop command with the job’s project ID and job ID:

    truss train stop --job-id <your_job_id>
    truss train stop --all # Stops all active jobs; Will prompt the user for confirmation.
    

    This will transition the job to the TRAINING_JOB_STOPPED state.

  • Understanding Job Outputs & Checkpoints:

    • The primary outputs of a successful TrainingJob are model checkpoints (if checkpointing is enabled and configured).
    • These checkpoints are stored by Baseten. Refer to the Checkpointing section in Core Concepts for how CheckpointingConfig works.
    • When you are ready to deploy a model, you will specify which checkpoints to use. The model_name you assign during deployment (via DeployCheckpointsConfig) becomes the identifier for this trained model version derived from your specific job’s checkpoints.
    • You can see the available checkpoints for a job via the Training API.