Skip to main content
One command gives you a private Slurm cluster on Baseten training infrastructure. truss train workstation --node-count N provisions N full nodes, bootstraps Slurm across them, and prints the SSH command to connect. Every node runs slurmd, the rank-0 node also runs slurmctld as the controller, and each node’s GPUs register as gres automatically. If you already launch distributed work with srun or sbatch, the cluster behaves the way you expect. This page covers what Baseten provisions and the commands to verify and use it.
Slurm only runs on multi-node workstations. A single-node workstation (launched with --gpu-count, or --node-count 1) skips Slurm and holds the container open for SSH, even though --orchestrator defaults to slurm.
For single-node workstations, see SSH access. For non-interactive multi-node training jobs, see Multinode training.

How Baseten builds the cluster

When --node-count is greater than 1, every node runs a Slurm bootstrap at startup:
  1. Each node installs Slurm and munge, then detects its GPUs.
  2. Nodes coordinate through the shared project cache. The node with BT_NODE_RANK=0 generates the munge key, and every node registers its IP, hostname, GPU count, and CPU count under $BT_PROJECT_CACHE_DIR/slurm_workstation/.
  3. Once all BT_GROUP_SIZE nodes register, the controller generates /etc/slurm/slurm.conf and distributes it: cluster name workstation, a single default partition named gpu with no time limit, and each node’s GPUs registered as gres.
  4. The controller starts slurmctld and slurmd; workers start slurmd. The controller is also a compute node, so all N nodes accept work.
Every node ends up with the same slurm.conf and munge key, so Slurm commands work from any node. For the environment variables Baseten injects (BT_NODE_RANK, BT_GROUP_SIZE, BT_PROJECT_CACHE_DIR, and more), see the SDK reference.

Launch a workstation

First, set up SSH access if you haven’t:
uvx truss ssh setup
Then launch a multi-node workstation:
uvx truss train workstation --node-count 2 --accelerator H100
  • --node-count provisions full nodes, using all GPUs on each. It’s mutually exclusive with --gpu-count, which configures single-node workstations.
  • --accelerator selects the GPU type (H100 by default).
  • --image swaps the base image (default nvidia/cuda:12.8.1-devel-ubuntu24.04). The Slurm bootstrap installs its own packages, so any Debian-based image with your framework preinstalled works.
  • --orchestrator defaults to slurm, the only supported orchestrator.
See the CLI reference for all options. Once the cluster is up, connect using the SSH command printed in the output. The cluster lives until you stop the workstation, and nodes bill per minute while up. Stop it with truss train stop when you finish; that tears down Slurm and releases the nodes.

Verify the cluster

After connecting, confirm the cluster sees every node and GPU:
sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up   infinite      2   idle node-[0-1]
Run a command across all nodes through the scheduler:
srun --nodes=2 hostname
node-0
node-1
Check that each node registered its GPUs:
scontrol show nodes | grep -E "NodeName|Gres"
NodeName=node-0 ...
   Gres=gpu:8
NodeName=node-1 ...
   Gres=gpu:8
To confirm which node your SSH session landed on, run echo $BT_NODE_RANK; rank 0 is the controller.

Run distributed work

The project cache directory is shared across all nodes. Put your code, data, and outputs there so every rank sees the same files:
cd $BT_PROJECT_CACHE_DIR
git clone https://github.com/basetenlabs/ml-cookbook.git
Launch interactively with srun:
srun --nodes=2 --ntasks-per-node=1 --gres=gpu:$BT_NUM_GPUS python train.py
Or create pretrain.sbatch on the shared cache:
#!/bin/bash
#SBATCH --job-name=pretrain
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --chdir=/root/.cache/user_artifacts
#SBATCH --output=%x-%j.log

srun python train.py
#SBATCH lines don’t expand environment variables. The --chdir directive uses the literal cache path because $BT_PROJECT_CACHE_DIR resolves to /root/.cache/user_artifacts, and the per-node GPU count is passed on the command line, where $BT_NUM_GPUS expands to the GPUs per node for your accelerator. Submit the job and track it:
sbatch --gres=gpu:$BT_NUM_GPUS pretrain.sbatch
squeue
Submitted batch job 1
JOBID PARTITION     NAME USER ST  TIME NODES NODELIST(REASON)
    1       gpu pretrain root  R  0:07     2 node-[0-1]
Slurm sets the usual SLURM_* environment variables (SLURM_NODEID, SLURM_NTASKS, SLURM_JOB_NODELIST), so distributed launchers like torchrun pick up the topology the standard way. For job arrays, dependencies, and everything beyond launching, see the Slurm documentation.

Checkpoints and the shared cache

Workstations support the same storage as training jobs:
  • The shared cache mounts on every node and persists across workstation restarts within a project. See Cache.
  • Pass --enable-checkpointing (with optional --checkpoint-path and --checkpoint-volume-size) to mount checkpoint storage, and --checkpoint-from-job to load the latest checkpoint from a previous job. See Checkpoints.

Notes and limits

  • Everything runs as root, and there is one partition. The bootstrap regenerates slurm.conf on every start, so manual edits don’t survive a restart.
  • Multi-node workstations always allocate full nodes; there is no fractional multi-node sizing.

Next steps

Once your training script behaves across nodes, the same project can run it as a non-interactive multi-node training job, with the cache and checkpoints carrying over.

SSH access

Single-node workstations and direct SSH connections.

VS Code & Cursor

Attach your IDE to a workstation with remote tunnels.

Multinode training

Non-interactive distributed training jobs.

CLI reference

All truss train workstation options.