SamplingClient

SamplingClient generates text completions from the model the sampler currently has loaded. There are two ways to create one. ServiceClient.create_sampling_client provisions a standalone sampler that serves a base model or a fixed checkpoint, with no weight sync. TrainingClient.save_weights_and_get_sampling_client returns a client on the run’s weight-synced paired sampler that serves at least the version you just published. Both clients expose the same sample method. For high-concurrency workloads, the SDK configures a large HTTP connection pool by default. Override it with BT_LOOPS_MAX_CONNECTIONS when you need a different limit, for example on a VM behind Cloud NAT with a tight port budget. Generate from the current weights:

from baseten.loops import ModelInput, SamplingParams

result = sampling_client.sample(
    prompt=ModelInput.from_ints(prompt),
    num_samples=1,
    sampling_params=SamplingParams(max_tokens=16),
)
print(result.sequences[0].tokens)

SampleResponse

Generate num_samples completions from prompt (a ModelInput). Pass a SamplingParams instance to control temperature, top-p, top-k, max tokens, seed, and stop sequences; omit it to use defaults. Set include_prompt_logprobs=True to get per-token log-probabilities for the input tokens alongside the output, and set topk_prompt_logprobs above 0 to also return the top-k alternatives at each prompt position. The sampler resolves which adapter or base model to serve from the version headers the client carries, so there is no per-call model override.affinity_key (keyword-only) is sent as the X-Baseten-Session-ID header so sticky-routing-enabled Baseten deployments send related requests to the same sampler replica, keeping a rollout group’s shared prompt in one replica’s prefix cache. Reuse one key across every rollout and turn in a rollout group; don’t reuse a single key for an entire run. Must be a non-empty string when set.

list[float | None]

Return the per-token log-probabilities for prompt without generating any new tokens. Index 0 is always None because the first token has no preceding context to score against. Other positions may also be None if the sampler can’t compute a log-probability for that token. affinity_key works exactly as in sample(); scoring a prompt is a full prefill, so passing the rollout group’s key routes the request to the replica whose prefix cache already holds the group’s shared prompt.

str

Return the base model ID from the sampler’s /v1/models list, specifically the entry with no parent. Retries with backoff while the sampler is still deploying.

str | None

Return the currently registered LoRA adapter ID (the first /v1/models entry with a non-null parent), or None if no adapter is loaded.

str

Return the base model ID this sampling client was created with, without contacting the server.

PreTrainedTokenizer

Return the Hugging Face PreTrainedTokenizer for the base model this client was created with.

None

Block until the sampler’s deployment status is ACTIVE. A scaled-to-zero deployment triggers one wake; terminal-failure states raise. No-op for local deployments.

None

Class method. Block until deployment reports ready, using a throwaway SamplingClient so you can wait without holding one. Polls up to ready_timeout seconds and applies the same readiness semantics as ensure_ready.

Reference

Inference API

Management API

CLI reference

SDK reference

Training API

Frontier Gateway API

CI/CD