Deploy Llama 3.3 70B Instruct
Deploy Llama 3.3 70B Instruct
Llama 3.3 70B requires significant GPU memory. We recommend an H100 or A100 node with at least two GPUs (using tensor parallelism) to serve the model efficiently with FP8 quantization.Configuration
The followingconfig.yaml uses the Engine-Builder-LLM to serve Llama 3.3. Note that this is a gated model; you must accept the license on Hugging Face and provide an hf_access_token secret in your Baseten workspace.
Run inference
Llama 3.3 on Engine-Builder-LLM provides a full OpenAI-compatible API, including support for streaming and tool calling.- Python SDK
- cURL
Configuration and tuning
Llama 3.3 70B is a massive model that benefits significantly from hardware-specific optimizations.Hardware and tensor parallelism
For 70B parameter models, tensor parallelism is essential. By splitting the model across two H100 GPUs (tensor_parallel_count: 2), we can fit the model in memory while maintaining low latency. The fp8_kv quantization further optimizes memory usage by using 8-bit precision for both weights and the KV cache.
Gated models
Because Llama 3.3 is a gated model, your deployment will fail if thehf_access_token is missing or invalid. Ensure you’ve created a secret named hf_access_token in your Baseten dashboard before pushing the model.
Related
- Model APIs — Instant access to Llama 3.3 via shared endpoints.
- Engine-Builder-LLM documentation — Details on the engine used for this model.
- Truss examples — Source code for this Truss.