truss train workstation --node-count N provisions N full nodes, bootstraps Slurm across them, and prints the SSH command to connect. Every node runs slurmd, the rank-0 node also runs slurmctld as the controller, and each node’s GPUs register as gres automatically.
If you already launch distributed work with srun or sbatch, the cluster behaves the way you expect. This page covers what Baseten provisions and the commands to verify and use it.
Slurm only runs on multi-node workstations. A single-node workstation (launched with
--gpu-count, or --node-count 1) skips Slurm and holds the container open for SSH, even though --orchestrator defaults to slurm.How Baseten builds the cluster
When--node-count is greater than 1, every node runs a Slurm bootstrap at startup:
- Each node installs Slurm and munge, then detects its GPUs.
- Nodes coordinate through the shared project cache. The node with
BT_NODE_RANK=0generates the munge key, and every node registers its IP, hostname, GPU count, and CPU count under$BT_PROJECT_CACHE_DIR/slurm_workstation/. - Once all
BT_GROUP_SIZEnodes register, the controller generates/etc/slurm/slurm.confand distributes it: cluster nameworkstation, a single default partition namedgpuwith no time limit, and each node’s GPUs registered asgres. - The controller starts
slurmctldandslurmd; workers startslurmd. The controller is also a compute node, so all N nodes accept work.
slurm.conf and munge key, so Slurm commands work from any node. For the environment variables Baseten injects (BT_NODE_RANK, BT_GROUP_SIZE, BT_PROJECT_CACHE_DIR, and more), see the SDK reference.
Launch a workstation
First, set up SSH access if you haven’t:--node-countprovisions full nodes, using all GPUs on each. It’s mutually exclusive with--gpu-count, which configures single-node workstations.--acceleratorselects the GPU type (H100 by default).--imageswaps the base image (defaultnvidia/cuda:12.8.1-devel-ubuntu24.04). The Slurm bootstrap installs its own packages, so any Debian-based image with your framework preinstalled works.--orchestratordefaults toslurm, the only supported orchestrator.
truss train stop when you finish; that tears down Slurm and releases the nodes.
Verify the cluster
After connecting, confirm the cluster sees every node and GPU:echo $BT_NODE_RANK; rank 0 is the controller.
Run distributed work
The project cache directory is shared across all nodes. Put your code, data, and outputs there so every rank sees the same files:srun:
pretrain.sbatch on the shared cache:
#SBATCH lines don’t expand environment variables. The --chdir directive uses the literal cache path because $BT_PROJECT_CACHE_DIR resolves to /root/.cache/user_artifacts, and the per-node GPU count is passed on the command line, where $BT_NUM_GPUS expands to the GPUs per node for your accelerator. Submit the job and track it:
SLURM_* environment variables (SLURM_NODEID, SLURM_NTASKS, SLURM_JOB_NODELIST), so distributed launchers like torchrun pick up the topology the standard way. For job arrays, dependencies, and everything beyond launching, see the Slurm documentation.
Checkpoints and the shared cache
Workstations support the same storage as training jobs:- The shared cache mounts on every node and persists across workstation restarts within a project. See Cache.
- Pass
--enable-checkpointing(with optional--checkpoint-pathand--checkpoint-volume-size) to mount checkpoint storage, and--checkpoint-from-jobto load the latest checkpoint from a previous job. See Checkpoints.
Notes and limits
- Everything runs as root, and there is one partition. The bootstrap regenerates
slurm.confon every start, so manual edits don’t survive a restart. - Multi-node workstations always allocate full nodes; there is no fractional multi-node sizing.
Next steps
Once your training script behaves across nodes, the same project can run it as a non-interactive multi-node training job, with the cache and checkpoints carrying over.SSH access
Single-node workstations and direct SSH connections.
VS Code & Cursor
Attach your IDE to a workstation with remote tunnels.
Multinode training
Non-interactive distributed training jobs.
CLI reference
All
truss train workstation options.