Skip to main content

View on GitHub

Built in Rust and integrated with Python and Node.js, the Performance Client handles concurrent POST requests efficiently. It releases the Python GIL while executing requests, enabling simultaneous sync and async usage. Benchmarks show the Performance Client reaches 1200+ requests per second per client. Use it with Baseten deployments or third-party providers like OpenAI. benchmarks

Installation

Python

pip install baseten_performance_client

Node.js

npm install baseten-performance-client

Rust

cargo add baseten_performance_client_core

Get started

Python

from baseten_performance_client import PerformanceClient

# client = PerformanceClient(base_url="https://api.baseten.co", api_key="YOUR_API_KEY")
# Also works with most third-party providers
client = PerformanceClient(
    base_url="https://model-yqv4yjjq.api.baseten.co/environments/production/sync", 
    api_key="YOUR_API_KEY"
)

Node.js

const { PerformanceClient } = require("baseten-performance-client");

// const client = new PerformanceClient("https://api.baseten.co", process.env.BASETEN_API_KEY);
// Also works with third-party providers
const client = new PerformanceClient(
  "https://model-yqv4yjjq.api.baseten.co/environments/production/sync",
  process.env.BASETEN_API_KEY
);
You can also use any OpenAI-compatible endpoints by replacing the base_url.

Embeddings

The client provides efficient embedding requests with configurable batching, concurrency, and latency optimizations.

Example (Python)

texts = ["Hello world", "Example text", "Another sample"] * 10

response = client.embed(
    input=texts,
    model="my_model",
    batch_size=16,
    max_concurrent_requests=256,
    max_chars_per_request=10000,
    hedge_delay=0.5,
    timeout_s=360,
    total_timeout_s=600
)

# Access embedding data
numpy_array = response.numpy() # requires numpy
Advanced parameters:
  • max_chars_per_request, batch_size: Packs and batches requests by number of entries or character count, whichever limit is reached first. Useful for optimal distribution across replicas.
  • hedge_delay: Sends duplicate requests after a delay (≥0.2s) to reduce p99.5 latency. After the delay, the request is cloned and raced against the original. Limited by a 5% budget. Default: disabled.
  • timeout_s: Timeout for each request. Raises TimeoutError when a single request cannot be retried. 429 and 5xx errors are always retried.
  • total_timeout_s: Total timeout for the entire operation in seconds. Sets an upper bound on the total time for all batched requests combined.
Async usage is also supported:
import asyncio

async def main():
    response = await client.async_embed(input=texts, model="my_model")
    print(response.data)

# asyncio.run(main())

Example (Node.js)

const texts = ["Hello world", "Example text", "Another sample"];
const response = await client.embed(
    texts,                      // input
    "my_model",                 // model
    null,                       // encodingFormat
    null,                       // dimensions
    null,                       // user
    32,                         // maxConcurrentRequests
    4,                          // batchSize
    360.0,                      // timeoutS
    10000,                      // maxCharsPerRequest
    0.5                         // hedgeDelay
);

// Accessing embedding data
console.log(`Model used: ${response.model}`);
console.log(`Total tokens used: ${response.usage.total_tokens}`);

Batch POST

Use batch_post for sending POST requests to any URL. Built for benchmarks (p90/p95/p99 timings). Useful for starting off massive batch tasks, or benchmarking the performance of individual requests, while retaining a capped concurrency. Releasing the GIL during all calls - you can do work in parallel without impacting performance.

Example (Python) - completions/chat completions

# requires stream=false / non-sse response.
payloads = [
    {"model": "my_model", "prompt": "Batch request 1", "stream": False},
    {"model": "my_model", "prompt": "Batch request 2", "stream": False}
] * 10

response = client.batch_post(
    url_path="/v1/completions",
    payloads=payloads,
    max_concurrent_requests=96,
    timeout_s=720,
    hedge_delay=30,
)
responses = response.data # array with 20 dicts
# timings = response.individual_request_times # array with the time.time() for each request

Example (Node.js)

const payloads = [
  { model: "my_model", input: ["Batch request 1"] },
  { model: "my_model", input: ["Batch request 2"] },
];

const response = await client.batchPost(
    "/v1/embeddings",           // urlPath
    payloads,                   // payloads
    96,                         // maxConcurrentRequests
    360.0                       // timeoutS
);

Reranking

Compatible with BEI and text-embeddings-inference.

Example (Python)

response = client.rerank(
    query="What is the best framework?",
    texts=["Doc 1", "Doc 2", "Doc 3"],
    return_text=True,
    batch_size=2,
    max_concurrent_requests=16
)
for res in response.data:
    print(f"Index: {res.index} Score: {res.score}")

Classification

Supports classification endpoints such as BEI or text-embeddings-inference.

Example (Python)

response = client.classify(
    inputs=[
        "This is great!",
        "I did not like it.",
        "Neutral experience."
    ],
    batch_size=2,
    max_concurrent_requests=16
)
for group in response.data:
    for result in group:
        print(f"Label: {result.label}, Score: {result.score}")

Error handling

The client raises standard Python/Node.js errors:
  • HTTPError: Authentication failures, 4xx/5xx responses.
  • Timeout: Raised when a request or the total operation times out.
  • ValueError: Invalid inputs (e.g., empty list, invalid batch size).
Example:
import requests

try:
    response = client.embed(input=["Hello"], model="my_model")
except requests.exceptions.HTTPError as e:
    print(f"HTTP error: {e}, status code: {e.response.status_code}")
except requests.exceptions.Timeout as e:
    print(f"Timeout error: {e}")
except ValueError as e:
    print(f"Input error: {e}")

Further reading