The Baseten Performance Client is a high-performance library for interacting with model endpoints, supporting embeddings, reranking, and classification. It is available for both Python (pip) and Node.js (npm).
base_url
.
max_chars_per_request
, batch_size
: Packs/Batches requests by number of entries or character count, whatever limit is reached first. Useful for optimial distribution across all your replicas on baseten.hedge_delay
: Send duplicate requests after a delay (β₯0.2s) to reduce the p99.5 latency. After hedge_delay (s) is met, your request will be cloned once and race the original request. Limited by a 5% budget. Default: disabled.timeout_s
: Timeout on each request. Raised a request.TimeoutError once a single request canβt be retried. 429 and 5xx errors are always retried.batch_post
for sending POST requests to any URL.
Built for benchmarks (p90/p95/p99 timings). Useful for starting off massive batch tasks, or benchmarking the performance of individual requests, while retaining a capped concurrency.
Releasing the GIL during all calls - you can do work in parallel without impacting performance.