Links
🚀

Deploying models

Deploy any model to Baseten
This page guides you through deploying a model from the Baseten Python client. If you just want to deploy a popular foundation model as-is, you can do so from the model library.
In this doc, we'll deploy and publish WizardLM, an open-source LLM tuned for chat. But the code works for any model packaged as a Truss.

Setup

Make sure you've done three quick steps before deploying models:
  1. 1.
    Install the Baseten Python client with pip install --upgrade baseten
  2. 2.
    Create an API key for your Baseten account
  3. 3.
    In your terminal, authenticate with:
baseten login

Deploying a draft

For this step, you'll need model packaged as a Truss. We'll use WizardLM, an open-source LLM tuned for chat. You can download WizardLM with:
git clone https://github.com/basetenlabs/wizardlm-truss.git
You can also use these steps to deploy your own packaged model or a Truss from GitHub. With your Truss in hand, let's deploy.
To deploy your model, run:
import baseten
import truss
wizardlm = truss.load("./wizardlm-truss")
baseten.deploy(
wizardlm,
model_name="WizardLM",
#is_trusted=True # Uncomment to give the model access to secrets
)
This will deploy your model in a draft state. Draft models differ from published models in three important ways:
  1. 1.
    Draft models are mutable and are not versioned. This means you can change your model as a draft over and over again without incrementing versions or changing version IDs.
  2. 2.
    Most updates are compatible with live reload, making testing changes 10X to 100X faster.
  3. 3.
    Draft models are not suitable for production workloads.

Invoking draft models

You can invoke your draft model via the Baseten Python client with the model's draft version ID:
import baseten
model = baseten.deployed_model_version_id("qwerty12") # Replace with your actual version ID
model.predict({"prompt": "What is the difference between a wizard and a sorcerer?"})

Live reload

Draft models have live reload, which dramatically reduces the time it takes to test most changes by patching the running model server rather than re-deploying the entire model serving environment.
A live reload workflow
Right now, draft models support live reload for:
  • Changes to files and subdirectories in your Truss' model/ directory, such as ./wizardlm-truss/model/model.py
  • Changes to your required Python packages in ./wizardlm-truss/config.yaml
  • Changes to your required system packages in ./wizardlm-truss/config.yaml
In the future, draft models will also support live reload for environment variables and model binaries.
After updating your Truss, just re-run:
wizardlm = truss.load("./wizardlm-truss")
baseten.deploy(
wizardlm,
model_name="WizardLM" # model_name MUST stay the same between deployments of the same model
)
Your changes, if compatible with live reload, will be patched onto the running model server. Otherwise, the model serving environment will be rebuilt.

Draft model limitations

Draft models are intended for development and testing, not production workloads. As such, draft models have a few limitations:
  1. 1.
    Draft models receive the instance type you specify but don't scale beyond one replica.
  2. 2.
    Draft models scale to zero after twenty minutes when not in use. Unlike published models, this behavior is not configurable.
  3. 3.
    Requests to draft models may fail if they are sent while the model is updating.

Publishing your model

Once you've tested your model and are ready to use it in production, it's time to publish your model.
Why publish a model?
What happens when you publish your model?
  • The model will be re-built onto production autoscaling infrastructure
  • The published model will have a new version ID, pointing to the new published version
  • The draft model will be deleted and its version ID will no longer be valid

Publishing via the Python client

You can publish your model via the Python client. Just add publish=True to your baseten.deploy() invocation:
wizardlm = truss.load("./wizardlm-truss")
baseten.deploy(
wizardlm,
model_name="WizardLM", # model_name MUST stay the same between deployments of the same model
#is_trusted=True, # Uncomment to give the model access to secrets
publish=True
)
Your model will rebuild onto production infrastructure and you will receive an email when the process is complete.
You can use publish=True during your initial deployment to skip the draft model stage.

Publishing via the Baseten UI

You can also publish your draft from the Baseten UI. From the model's page, click on the three-dot menu next to "Draft" and select "Publish model version."
Publish model version
Your model will rebuild onto production infrastructure and you will receive an email when the process is complete.
If you do not want to publish your draft model to production, you can deactivate or delete it just like any other model version.

Invoking published models

You can invoke your published model by updating the version ID:
import baseten
model = baseten.deployed_model_version_id("qwerty12") # Replace with your actual version ID
model.predict({"prompt": "What is the difference between a wizard and a sorcerer?"})