Skip to main content

Deploy Llama 3.3 70B Instruct

Llama 3.3 70B Instruct is a highly capable large language model optimized for dialogue, reasoning, and instruction following. As one of the most powerful open-weights models available, it offers performance comparable to proprietary models while allowing for full control over deployment and data. On Baseten, you can deploy Llama 3.3 using the Engine-Builder-LLM for production-grade inference.

Deploy Llama 3.3 70B Instruct

Llama 3.3 70B requires significant GPU memory. We recommend an H100 or A100 node with at least two GPUs (using tensor parallelism) to serve the model efficiently with FP8 quantization.

Configuration

The following config.yaml uses the Engine-Builder-LLM to serve Llama 3.3. Note that this is a gated model; you must accept the license on Hugging Face and provide an hf_access_token secret in your Baseten workspace.
model_name: llama-3-3-70b
resources:
  accelerator: H100:2
  cpu: '1'
  memory: 24Gi
  use_gpu: true
secrets:
  hf_access_token: null
trt_llm:
  build:
    checkpoint_repository:
      repo: meta-llama/Llama-3.3-70B-Instruct
      source: HF
    num_builder_gpus: 4
    quantization_type: fp8_kv
    max_seq_len: 131072
    tensor_parallel_count: 2
  runtime:
    enable_chunked_context: true
    request_default_max_tokens: 131072
Deploy the model with:
truss push

Run inference

Llama 3.3 on Engine-Builder-LLM provides a full OpenAI-compatible API, including support for streaming and tool calling.
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-{MODEL_ID}.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
    model="llama-3-3",
    messages=[
        {"role": "user", "content": "Explain the significance of the Llama in Inca culture."}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

Configuration and tuning

Llama 3.3 70B is a massive model that benefits significantly from hardware-specific optimizations.

Hardware and tensor parallelism

For 70B parameter models, tensor parallelism is essential. By splitting the model across two H100 GPUs (tensor_parallel_count: 2), we can fit the model in memory while maintaining low latency. The fp8_kv quantization further optimizes memory usage by using 8-bit precision for both weights and the KV cache.

Gated models

Because Llama 3.3 is a gated model, your deployment will fail if the hf_access_token is missing or invalid. Ensure you’ve created a secret named hf_access_token in your Baseten dashboard before pushing the model.