Configure HTTP clients

When calling Baseten at scale, HTTP client configuration directly affects reliability and throughput. Misconfigured clients cause Connection refused and Client closed connection errors that look like platform issues but originate client-side. This page covers the settings that matter.

For a drop-in solution, use the Performance Client, which handles connection pooling, retries, and concurrency automatically.

Reuse client sessions

Creating a new HTTP client per request is the most common misconfiguration. Each new client opens a fresh TCP connection, performs a full TLS handshake, and then discards the connection after a single use. Under load, this pattern quickly exhausts available ports and produces Connection refused errors that appear intermittent and difficult to diagnose. A reused client maintains a pool of open connections that are ready for subsequent requests. This eliminates per-request connection overhead and keeps your throughput stable as concurrency increases.

Recommended
Anti-pattern

Create a single client session and reuse it for all requests. For example, in Python you would set up the client at the start of the script and reuse it for all requests.

# Correct: reuse a client session
client = httpx.Client(
    base_url=f"https://model-{model_id}.api.baseten.co",
    headers={"Authorization": f"Api-Key {api_key}"},
)


def predict(payload):
    response = client.post("/environments/production/predict", json=payload)
    return response.json()

Creating a new client session for each request opens a fresh TCP connection every time.

# Anti-pattern: new client per request
def predict(payload):
    response = httpx.post(
        url, json=payload, headers=headers
    )  # New connection every time
    return response.json()

Choose an HTTP client

Your choice of HTTP client library determines which connection management features are available to you. The httpx library is recommended over requests for Baseten workloads because it provides built-in connection pooling, native async support, and optional HTTP/2. The requests library can achieve connection reuse through its Session object, but lacks async support and requires more manual configuration. The OpenAI Python SDK uses httpx internally, so if you’re already using it, you benefit from httpx’s connection handling by default. For example, here is how to create a basic httpx.Client:

import httpx

client = httpx.Client(
    base_url=f"https://model-{model_id}.api.baseten.co",
    headers={"Authorization": f"Api-Key {api_key}"},
)

Configure connection pooling

Connection pooling keeps a set of open TCP connections ready for reuse. When your client sends a request, it draws from this pool instead of opening a new connection. This avoids the cost of repeated TCP handshakes and TLS negotiations, which can add 50-100ms of latency per request. The default httpx pool limits (100 total connections, 20 per host) work for moderate workloads, but high-throughput applications that send hundreds of concurrent requests will exhaust these limits. When the pool is full, new requests block until a connection becomes available, resulting in PoolTimeout errors or increased latency. Increase the pool limits based on your peak concurrency using httpx.Limits. The max_keepalive_connections setting controls how many idle connections stay open, and keepalive_expiry controls how long idle connections persist before closing. Baseten keeps connections alive for 60-120 seconds, so setting your client’s expiry below the server minimum avoids hitting dead connections.

import httpx

limits = httpx.Limits(
    max_connections=256,
    max_keepalive_connections=128,
    keepalive_expiry=30,
)

client = httpx.Client(
    base_url=f"https://model-{model_id}.api.baseten.co",
    headers={"Authorization": f"Api-Key {api_key}"},
    limits=limits,
)

Recommended values

Setting	Default (httpx)	Recommended
Max connections	100	256
Max keepalive connections	20	128
Keep-alive idle timeout	5s	30s
Keep-alives	Enabled	Enabled

These values apply when calling a single Baseten model endpoint. If you call multiple models, increase max connections proportionally.

Keep-alives are always enabled on Baseten.

Set timeouts

httpx applies a default 5-second timeout to all operations, which is too short for most inference workloads. LLM generation, image processing, and other model inference tasks routinely take tens of seconds to minutes. Without properly configured timeouts, your client will close connections before the model finishes processing. Set client timeouts based on your model’s expected response time. Baseten’s ingress proxy allows up to 10 minutes (600 seconds) for synchronous predict requests, but your client-side timeouts should reflect your actual workload rather than matching the server maximum. httpx lets you configure four separate timeout values with httpx.Timeout. Separating connect and read timeouts prevents slow network conditions from being confused with slow model responses.

import httpx

timeout = httpx.Timeout(
    connect=10.0,  # Time to establish connection
    read=600.0,  # Time to receive response
    write=30.0,  # Time to send request body
    pool=10.0,  # Time to acquire a connection from the pool
)

client = httpx.Client(
    base_url=f"https://model-{model_id}.api.baseten.co",
    headers={"Authorization": f"Api-Key {api_key}"},
    timeout=timeout,
)

Timeout guidance by use case

Use case	Connect	Read	Notes
LLM inference (sync)	10s	600s	Long generation times
Embedding/classification	10s	60s	Faster response
Async predict (submit)	10s	30s	Just submitting the job
Streaming	10s	600s	Keep open for full stream

For long-running requests that exceed sync timeouts, use async inference with polling.

Implement retries

Transient errors happen at scale and can negatively impact your application’s reliability and throughput. Retry with exponential backoff using libraries like tenacity. Only retry on transient errors. Retrying client errors like 400 or 401 wastes time and can mask bugs in your request payload. Retry on these status codes and connection errors:

429 (rate limited)
500 (internal server error)
502 (bad gateway)
503 (service unavailable)
504 (gateway timeout)
Connection errors (ConnectError, ReadTimeout)

Don’t retry on these status codes:

400 (bad request)
401 (unauthorized)
403 (forbidden)
404 (not found)
422 (validation error)

The following example uses httpx with tenacity to retry failed requests with exponential backoff.

import httpx
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception


def is_retryable(exception):
    if isinstance(exception, httpx.HTTPStatusError):
        return exception.response.status_code in (429, 500, 502, 503, 504)
    return isinstance(exception, (httpx.ConnectError, httpx.ReadTimeout))


@retry(
    retry=retry_if_exception(is_retryable),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    stop=stop_after_attempt(5),
)
def predict(client, payload):
    response = client.post("/environments/production/predict", json=payload)
    response.raise_for_status()
    return response.json()

Handle errors

Many errors that look like platform outages actually originate from client-side misconfiguration. Before opening a support ticket, check whether your error matches one of these common patterns. If you see PoolTimeout or Connection refused under high concurrency, the issue is almost always your client’s pool configuration, not Baseten’s servers.

Error	Likely cause	Resolution
`PoolTimeout`	Connection pool exhausted	Increase pool size or reduce concurrency
`ConnectTimeout`	Network issue or server unavailable	Check network, then retry
`ReadTimeout`	Model taking longer than expected	Increase read timeout for your use case
`Connection refused`	Client-side port or pool exhaustion	Increase pool limits, check NAT config

Monitor connections

Connection problems tend to surface as intermittent failures rather than complete outages, making them difficult to diagnose without proper monitoring. A gradually exhausting connection pool won’t cause errors until it’s completely full, at which point requests start failing unpredictably. Watch for these signals:

Rising p99 latency without changes to model performance, which often indicates pool contention.
Sporadic Connection refused errors under load, which point to port or pool exhaustion.
TCP retransmits increasing over time, which suggest connections are being dropped and recreated.

If you route traffic through a NAT gateway, monitor port utilization. Each outbound connection consumes a port, and high-concurrency workloads can exhaust the available port range, causing intermittent connection failures that are difficult to distinguish from server-side issues.

Use with proxies

Enterprise deployments often route traffic through HTTP proxies for security, logging, or network policy enforcement. httpx supports proxy configuration at the client level, so connection pooling and keep-alives continue to work through the proxy. You may need to increase your pool limits when using a proxy, since the additional network hop increases per-request latency, which means connections are held open longer and the pool drains faster under the same concurrency.

import httpx

client = httpx.Client(
    base_url=f"https://model-{model_id}.api.baseten.co",
    headers={"Authorization": f"Api-Key {api_key}"},
    proxy="http://corporate-proxy.example.com:8080",
    limits=httpx.Limits(max_connections=300),
)

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

Configure HTTP clients

Reuse client sessions

Choose an HTTP client

Configure connection pooling

Recommended values

Set timeouts

Timeout guidance by use case

Implement retries

Handle errors

Monitor connections

Use with proxies

Further reading

Get started

Concepts

Development

Deployment

Inference

Engines

Training

Organization

Observability

Troubleshooting

​Reuse client sessions

​Choose an HTTP client

​Configure connection pooling

​Recommended values

​Set timeouts

​Timeout guidance by use case

​Implement retries

​Handle errors

​Monitor connections

​Use with proxies

​Further reading

Reuse client sessions

Choose an HTTP client

Configure connection pooling

Recommended values

Set timeouts

Timeout guidance by use case

Implement retries

Handle errors

Monitor connections

Use with proxies

Further reading