Most models on Baseten allow you to send input data as a Base64 string within a JSON payload. While this is convenient for many use cases, Base64 encoding introduces serialization overhead, which may not be suitable for latency-sensitive applications.

For scenarios where low latency is critical—such as real-time transcription using Whisper—you can reduce input data overhead by sending it directly as bytes. This approach minimizes serialization overhead, helping to lower end-to-end latency.

Below is a Python code snippet demonstrating how to send audio data as bytes to a Whisper endpoint on Baseten:





import requests
import msgpack

# Your Baseten model id
model_id = ""

headers = {
	"Authorization": "Api-Key YOUR-BASETEN-API-KEY",
	"Content-Type": "application/octet-stream"
}

body = {
    "whisper_input": {
        "audio": {
            "audio_bytes": open('/path/to/audio/file.wav', 'rb').read()
        }
    }
}

resp = requests.post(
    f"https://model-{model_id}.api.baseten.co/environments/production/predict",
    headers=headers,
    data=msgpack.packb(body)
)

print(msgpack.unpackb(resp.content))

The two key details to note are:

  1. Setting the header “Content-Type”: “application/octet-stream”.
  2. Using msgpack for the request body, as in data=msgpack.packb(body), instead of a standard JSON object.