Skip to main content

View on GitHub

Built in Rust and integrated with Python and Node.js, the client is optimized for massive concurrent POST requests. It releases the Python GIL while executing requests, enabling simultaneous sync and async usage. In our benchmarks, the Performance Client reached 1200+ requests per second per client. It can be used when deploying on Baseten, as well as third-party providers such as OpenAI and Mixedbread. benchmarks

Installation

Python

pip install baseten_performance_client

Node.js

npm install baseten-performance-client

Rust

cargo add baseten_performance_client_core

Getting Started

Python

from baseten_performance_client import PerformanceClient

# client = PerformanceClient(base_url="https://api.baseten.co", api_key="YOUR_API_KEY")
# Also works with most third-party providers
client = PerformanceClient(
    base_url="https://model-yqv4yjjq.api.baseten.co/environments/production/sync", 
    api_key="YOUR_API_KEY"
)

Node.js

const { PerformanceClient } = require("baseten-performance-client");

// const client = new PerformanceClient("https://api.baseten.co", process.env.BASETEN_API_KEY);
// Also works with third-party providers
const client = new PerformanceClient(
  "https://model-yqv4yjjq.api.baseten.co/environments/production/sync",
  process.env.BASETEN_API_KEY
);
You can also use any OpenAI-compatible or Mixedbread endpoints by replacing the base_url.

Embeddings

The client provides efficient embedding requests with configurable batching, concurrency, and latency optimizations.

Example (Python)

texts = ["Hello world", "Example text", "Another sample"] * 10

response = client.embed(
    input=texts,
    model="my_model",
    batch_size=16,
    max_concurrent_requests=256,
    max_chars_per_request=10000,
    hedge_delay=0.5,
    timeout_s=360,
    total_timeout_s=600
)

# Access embedding data
numpy_array = response.numpy() # requires numpy
Advanced parameters
  • max_chars_per_request, batch_size: Packs/Batches requests by number of entries or character count, whatever limit is reached first. Useful for optimial distribution across all your replicas on baseten.
  • hedge_delay: Send duplicate requests after a delay (≥0.2s) to reduce the p99.5 latency. After hedge_delay (s) is met, your request will be cloned once and race the original request. Limited by a 5% budget. Default: disabled.
  • timeout_s: Timeout on each request. Raised a request.TimeoutError once a single request can’t be retried. 429 and 5xx errors are always retried.
  • total_timeout_s: Total timeout for the entire operation in seconds. Sets an upper bound on the total time for all batched requests combined.
Async usage is also supported:
import asyncio

async def main():
    response = await client.async_embed(input=texts, model="my_model")
    print(response.data)

# asyncio.run(main())

Example (Node.js)

const texts = ["Hello world", "Example text", "Another sample"];
const response = await client.embed(
    texts,                      // input
    "my_model",                 // model
    null,                       // encodingFormat
    null,                       // dimensions
    null,                       // user
    32,                         // maxConcurrentRequests
    4,                          // batchSize
    360.0,                      // timeoutS
    10000,                      // maxCharsPerRequest
    0.5                         // hedgeDelay
);

// Accessing embedding data
console.log(`Model used: ${response.model}`);
console.log(`Total tokens used: ${response.usage.total_tokens}`);

Batch POST

Use batch_post for sending POST requests to any URL. Built for benchmarks (p90/p95/p99 timings). Useful for starting off massive batch tasks, or benchmarking the performance of individual requests, while retaining a capped concurrency. Releasing the GIL during all calls - you can do work in parallel without impacting performance.

Example (Python) - completions/chat completions

# requires stream=false / non-sse response.
payloads = [
    {"model": "my_model", "prompt": "Batch request 1", "stream": False},
    {"model": "my_model", "prompt": "Batch request 2", "stream": False}
] * 10

response = client.batch_post(
    url_path="/v1/completions",
    payloads=payloads,
    max_concurrent_requests=96,
    timeout_s=720,
    hedge_delay=30,
)
responses = response.data # array with 20 dicts
# timings = response.individual_request_times # array with the time.time() for each request

Example (Node.js)

const payloads = [
  { model: "my_model", input: ["Batch request 1"] },
  { model: "my_model", input: ["Batch request 2"] },
];

const response = await client.batchPost(
    "/v1/embeddings",           // urlPath
    payloads,                   // payloads
    96,                         // maxConcurrentRequests
    360.0                       // timeoutS
);

Reranking

Compatible with BEI and text-embeddings-inference.

Example (Python)

response = client.rerank(
    query="What is the best framework?",
    texts=["Doc 1", "Doc 2", "Doc 3"],
    return_text=True,
    batch_size=2,
    max_concurrent_requests=16
)
for res in response.data:
    print(f"Index: {res.index} Score: {res.score}")

Classification

Supports classification endpoints such as BEI or text-embeddings-inference.

Example (Python)

response = client.classify(
    inputs=[
        "This is great!",
        "I did not like it.",
        "Neutral experience."
    ],
    batch_size=2,
    max_concurrent_requests=16
)
for group in response.data:
    for result in group:
        print(f"Label: {result.label}, Score: {result.score}")

Error Handling

The client raises standard Python/Node.js errors:
  • HTTPError: Authentication failures, 4xx/5xx responses.
  • Timeout: Raised when a request or the total operation times out.
  • ValueError: Invalid inputs (e.g., empty list, invalid batch size).
Example:
import requests

try:
    response = client.embed(input=["Hello"], model="my_model")
except requests.exceptions.HTTPError as e:
    print(f"HTTP error: {e}, status code: {e.response.status_code}")
except requests.exceptions.Timeout as e:
    print(f"Timeout error: {e}")
except ValueError as e:
    print(f"Input error: {e}")

More examples, contribute to the open-source libary or more detailed usage:

Check out the readme in in Github truss repo: baseten-performance-client