View on GitHub

Built in Rust and integrated with Python and Node.js, the client is optimized for massive concurrent POST requests. It releases the Python GIL while executing requests, enabling simultaneous sync and async usage. In our benchmarks, the Performance Client reached 1200+ requests per second per client. It can be used when deploying on Baseten, as well as third-party providers such as OpenAI and Mixedbread. benchmarks

Installation

Python

pip install baseten_performance_client

Node.js

npm install baseten-performance-client

Getting Started

Python

from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    base_url="https://model-yqv4yjjq.api.baseten.co/environments/production/sync",
    api_key="YOUR_API_KEY"
)

Node.js

const { PerformanceClient } = require("baseten-performance-client");

const client = new PerformanceClient(
  "https://model-yqv4yjjq.api.baseten.co/environments/production/sync",
  process.env.BASETEN_API_KEY
);
You can also use OpenAI or Mixedbread endpoints by replacing the base_url.

Embeddings

The client provides efficient embedding requests with configurable batching, concurrency, and latency optimizations.

Example (Python)

texts = ["Hello world", "Example text", "Another sample"] * 10

response = client.embed(
    input=texts,
    model="my_model",
    batch_size=16,
    max_concurrent_requests=256,
    max_chars_per_request=10000,
    hedge_delay=15,
    timeout_s=360
)
Advanced parameters
  • max_chars_per_request, batch_size: Packs/Batches requests by number of entries or character count, whatever limit is reached first. Useful for optimial distribution across all your replicas on baseten.
  • hedge_delay: Send duplicate requests after a delay (β‰₯0.2s) to reduce the p99.5 latency. After hedge_delay (s) is met, your request will be cloned once and race the original request. Limited by a 5% budget. Default: disabled.
  • timeout_s: Timeout on each request. Raised a request.TimeoutError once a single request can’t be retried. 429 and 5xx errors are always retried.
Async usage is also supported:
response = await client.async_embed(input=texts, model="my_model")

Example (Node.js)

const response = await client.embed(
  ["Hello world", "Example text", "Another sample"],
  "my_model"
);

Batch POST

Use batch_post for sending POST requests to any URL. Built for benchmarks (p90/p95/p99 timings). Useful for starting off massive batch tasks, or benchmarking the performance of individual requests, while retaining a capped concurrency. Releasing the GIL during all calls - you can do work in parallel without impacting performance.

Example (Python) - completions/chat completions

# requires stream=false / non-sse response.
payloads = [
    {"model": "my_model", "prompt": "Batch request 1", stream="false"},
    {"model": "my_model", "input": "Batch request 2", stream="false"}
] * 10

response = client.batch_post(
    url_path="/v1/completions",
    payloads=payloads,
    max_concurrent_requests=96,
    timeout_s=720,
    hedge_delay=30,
)
responses = response.data # array with 20 dicts
# timings = response.individual_request_times # array with the time.time() for each request

Example (Node.js)

const payloads = [
  { model: "my_model", input: ["Batch request 1"] },
  { model: "my_model", input: ["Batch request 2"] },
];

const response = await client.batchPost("/v1/embeddings", payloads, 96);

Reranking

Compatible with BEI and text-embeddings-inference.

Example (Python)

response = client.rerank(
    query="What is the best framework?",
    texts=["Doc 1", "Doc 2", "Doc 3"]
)

Classification

Supports classification endpoints such as BEI or text-embeddings-inference.

Example (Python)

response = client.classify(inputs=[
    "This is great!",
    "I did not like it.",
    "Neutral experience."
])

Error Handling

The client raises standard Python/Node.js errors:
  • HTTPError: Authentication failures, 4xx/5xx responses.
  • ValueError: Invalid inputs (e.g., empty list, invalid batch size).
Example:
try:
    response = client.embed(input=["Hello"], model="my_model")
except requests.exceptions.HTTPError as e:
    print("HTTP error:", e)

More examples, contribute to the open-source libary or more detailed usage:

Check out the readme in in Github truss repo: baseten-performance-client