Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
This guide walks through deploying Mistral-7B, a powerful large language model (LLM), using Truss. You’ll configure the model, set up inference, allocate resources, and deploy it as an API endpoint.
1. Set up your model
Start by importing the necessary libraries:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
Specify the Hugging Face model checkpoint:
CHECKPOINT = "mistralai/Mistral-7B-v0.1"
2. Define the model class
Create a Model class that loads Mistral-7B and its tokenizer when the server starts:
class Model:
def __init__(self, **kwargs) -> None:
self.tokenizer = None
self.model = None
def load(self):
self.model = AutoModelForCausalLM.from_pretrained(
CHECKPOINT, torch_dtype=torch.float16, device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
3. Implement inference
The predict function handles inference by tokenizing input, generating text, and decoding the output.
def predict(self, request: dict):
prompt = request.pop("prompt")
generate_args = {
"max_new_tokens": request.get("max_new_tokens", 128),
"temperature": request.get("temperature", 1.0),
"top_p": request.get("top_p", 0.95),
"top_k": request.get("top_k", 50),
"repetition_penalty": 1.0,
"use_cache": True,
"do_sample": True,
"eos_token_id": self.tokenizer.eos_token_id,
"pad_token_id": self.tokenizer.pad_token_id,
}
input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.cuda()
with torch.no_grad():
output = self.model.generate(input_ids=input_ids, **generate_args)
return self.tokenizer.decode(output[0])
4. Configure your deployment
Define dependencies
Specify the necessary Python packages in config.yaml:
model_name: Mistral 7B
python_version: py311
requirements:
- transformers==4.42.3
- sentencepiece==0.1.99
- accelerate==0.23.0
- torch==2.0.1
- numpy==1.26.4
Allocate compute resources
Mistral-7B requires an NVIDIA A10G GPU for efficient inference:
resources:
accelerator: A10G
use_gpu: true
5. Deploy the model
Push your Truss to Baseten:
Once deployed, call the model using the Truss CLI:
$ truss predict --published -d '{"prompt": "What is a large language model?"}'
Or send a request to the API endpoint:
import requests
response = requests.post(
"https://model-{yourmodelid}.api.baseten.co/production/predict",
headers={"Authorization": "Api-Key EMPTY"},
json={"prompt": "Explain quantum computing in simple terms"}
)
print(response.json())
6. Check for optimized engine support
For optimized performance we have open-source and Baseten optimized engines, such as Baseten’s TensorRT-LLM, Baseten-Embeddings-Inference, vLLM and SGLang.