Skip to main content

View example on GitHub

This guide walks through deploying Mistral-7B, a powerful large language model (LLM), using Truss. You’ll configure the model, set up inference, allocate resources, and deploy it as an API endpoint.

1. Set up your model

Start by importing the necessary libraries:
model/model.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
Specify the Hugging Face model checkpoint:
model/model.py
CHECKPOINT = "mistralai/Mistral-7B-v0.1"

2. Define the model class

Create a Model class that loads Mistral-7B and its tokenizer when the server starts:
model/model.py
class Model:
    def __init__(self, **kwargs) -> None:
        self.tokenizer = None
        self.model = None

    def load(self):
        self.model = AutoModelForCausalLM.from_pretrained(
            CHECKPOINT, torch_dtype=torch.float16, device_map="auto"
        )
        self.tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)

3. Implement inference

The predict function handles inference by tokenizing input, generating text, and decoding the output.
model/model.py
    def predict(self, request: dict):
        prompt = request.pop("prompt")
        generate_args = {
            "max_new_tokens": request.get("max_new_tokens", 128),
            "temperature": request.get("temperature", 1.0),
            "top_p": request.get("top_p", 0.95),
            "top_k": request.get("top_k", 50),
            "repetition_penalty": 1.0,
            "use_cache": True,
            "do_sample": True,
            "eos_token_id": self.tokenizer.eos_token_id,
            "pad_token_id": self.tokenizer.pad_token_id,
        }

        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.cuda()

        with torch.no_grad():
            output = self.model.generate(input_ids=input_ids, **generate_args)
            return self.tokenizer.decode(output[0])

4. Configure your deployment

Define dependencies

Specify the necessary Python packages in config.yaml:
config.yaml
model_name: Mistral 7B
python_version: py311
requirements:
  - transformers==4.42.3
  - sentencepiece==0.1.99
  - accelerate==0.23.0
  - torch==2.0.1
  - numpy==1.26.4

Allocate compute resources

Mistral-7B requires an NVIDIA A10G GPU for efficient inference:
config.yaml
resources:
  accelerator: A10G
  use_gpu: true

5. Deploy the model

Push your Truss to Baseten:
$ truss push
Once deployed, call the model using the Truss CLI:
$ truss predict --published -d '{"prompt": "What is a large language model?"}'
Or send a request to the API endpoint:
import requests

response = requests.post(
    "https://model-{yourmodelid}.api.baseten.co/production/predict",
    headers={"Authorization": "Api-Key EMPTY"},
    json={"prompt": "Explain quantum computing in simple terms"}
)

print(response.json())

6. Check for optimized engine support

For optimized performance we have open-source and Baseten optimized engines, such as Baseten’s TensorRT-LLM, Baseten-Embeddings-Inference, vLLM and SGLang.