TrainingClient

TrainingClient talks directly to a dp_worker instance. Long-running operations use a submit-and-retrieve protocol: the submit fires immediately on the calling thread (so validation errors surface at call time) and .result() long-polls the server until the operation finishes. You can submit multiple operations before awaiting any of them. Construct one with ServiceClient.create_lora_training_client. Run one training step and save a checkpoint. Each long-running call is submit-then-.result():

from baseten.loops import Datum, ModelInput, TensorData, AdamParams

# tokens and targets come from tokenizing a masked prompt/answer pair;
# see the quickstart for the full tokenization step.
datum = Datum(
    model_input=ModelInput.from_ints(tokens),
    loss_fn_inputs={"target_tokens": TensorData(data=targets, dtype="int64", shape=[len(targets)])},
)

fb = training_client.forward_backward(data=[datum]).result(timeout=600.0)
training_client.optim_step(AdamParams(learning_rate=4e-5)).result(timeout=600.0)
save_resp = training_client.save_state(name="step-1").result(timeout=600.0)

Every long-running server operation on ServiceClient, TrainingClient, and SamplingClient (for example, forward_backward, sample, create_lora_training_client) has an await-able *_async counterpart for callers running their own event loop. The async variants accept the same arguments as their synchronous names. Simpler blocking calls like health, ensure_ready, get_tokenizer, and close (whose async form is aclose) have no *_async twin.

ForwardBackwardFuture

Run a forward and backward pass over data (a list of Datum objects) using the specified loss function. Returns a ForwardBackwardFuture; call .result() to block until the pass completes and retrieve the loss.loss_fn defaults to "cross_entropy". The trainer accepts cross_entropy, importance_sampling, ppo, dppo, cispo, dro, and dpo. See Loss functions for the data shape and usage example for each.

ForwardBackwardFuture

Run a forward pass without gradient computation. Same inputs and output shape as forward_backward, but the gradient buffer is left untouched, so it is safe to interleave with gradient accumulation steps.

APIFuture[OptimStepResponse]

Apply the accumulated gradients using the Adam optimizer configured by AdamParams. Call this after one or more forward_backward calls.

APIFuture[SaveWeightsResponse]

Persist a local training checkpoint under name. When a weight sync URI is configured server-side, save_state also publishes the LoRA adapter so a polling sampler can hot-swap to the new weights.

APIFuture[SaveWeightsResponse]

Publish the LoRA adapter to the run’s paired sampler under name without returning a snapshot-pinned SamplingClient. Use this when you don’t need the version gate that save_weights_and_get_sampling_client provides.

SamplingClient

Publish the LoRA adapter to the run’s paired sampler under name and return a SamplingClient that serves at least the newly published version. The first call on a run provisions the paired sampler and links it to the run; later calls reuse the same sampler. The call blocks only until the publish completes. The returned client sends X-Min-Policy-Version on every sample() call; replicas below that version return 503 with Retry-After, and the SDK retries until the new weights are live. Your first sample() calls wait through the sampler’s cold start and weight load.

APIFuture[LoadWeightsResponse]

Load weights from a bt://loops:<run_id>/weights/<checkpoint> URI into this trainer. Use to resume training from a checkpoint.

APIFuture[LoadWeightsResponse]

Same as load_state but also restores Adam moments. Use when you want bit-exact resumption.

list[Checkpoint]

List checkpoints for the run bound to this client. Requires that this client was constructed using ServiceClient.create_lora_training_client (which populates the necessary session and run IDs automatically). Returns a list of Checkpoint.

CheckpointFilesResponse

Return presigned URLs for every file in a checkpoint folder. If the checkpoint files live in S3, export S3_REGION to that bucket’s AWS region before calling this method. Same semantics as ServiceClient.get_checkpoint_archive_url.

SamplingClient

Return a SamplingClient bound to the run’s paired sampler, serving the weights at model_path (a bt://loops:<run_id>/sampler_weights/<checkpoint> URI) until the trainer publishes a newer version. The first call on a run provisions the paired sampler; later calls reuse it. Distinct from ServiceClient.create_sampling_client, which provisions a standalone sampler with no weight sync.

PreTrainedTokenizer

Return the Hugging Face PreTrainedTokenizer for the base model. Cached after the first load.

GetInfoResponse

Return the model configuration for this training session (base model name, LoRA rank, and max sequence length) without a server round-trip.

str | None

Property. The run ID this client is bound to. Use this when filtering checkpoints or making HTTP API calls against the same run.

int

Property. The number of optim_step calls applied to the trainer so far. Each access issues a GET /policy_version round-trip, so read it deliberately rather than in a tight loop.

APIFuture[InitTrainerServerResponse]

Reset trainer state to a fresh LoRA adapter at lora_rank. Use to start a new adapter on an existing trainer without reprovisioning.

None

Check the trainer’s /health endpoint. Returns None on success and raises if the trainer is unreachable or unhealthy.

None

Close the client’s HTTP connections and finish any active Weights & Biases run. In async code, call aclose() instead.An open TrainingClient pings the trainer’s /health endpoint every 5 minutes from a background thread, holding the training session (and its GPUs) warm through long gaps between training calls, such as sampling phases in an RL loop. Calling close() stops the keepalive and lets the session be reclaimed as idle; it also stops when your process exits or after 20 consecutive failed pings.

None

Async counterpart to close() for callers running their own event loop.

Reference

Inference API

Management API

CLI reference

SDK reference

Training API

Frontier Gateway API

CI/CD