Mistral 7B Instruct from Baseten’s model library is a recent LLM with streaming support. Invocation should be the same for any other model library LLM as well as any Truss that follows the same standard.Deploy Mistral 7B Instruct or a similar LLM to run the following examples.
When using a streaming endpoint with cURL, use the --no-buffer flag to stream output as it is received.As with all cURL invocations, you’ll need a model ID and API key.
Copy
Ask AI
curl -X POST https://app.baseten.co/models/MODEL_ID/predict \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -d '{"prompt": "What is the Mistral wind?", "stream": true}' \ --no-buffer
Let’s take things a step further and look at how to integrate streaming output with a Python application.
Copy
Ask AI
import requestsimport jsonimport os# Model ID for production deploymentmodel_id = ""# Read secrets from environment variablesbaseten_api_key = os.environ["BASETEN_API_KEY"]# Open session to enable streamings = requests.Session()with s.post( # Endpoint for production deployment, see API reference for more f"https://model-{model_id}.api.baseten.co/production/predict", headers={"Authorization": f"Api-Key {baseten_api_key}"}, # Include "stream": True in the data dict so the model knows to stream data=json.dumps({ "prompt": "What even is AGI?", "stream": True, "max_new_tokens": 4096 }), # Include stream=True as an argument so the requests libray knows to stream stream=True,) as resp: # Print the generated tokens as they get streamed for content in resp.iter_content(): print(content.decode("utf-8"), end="", flush=True)