Deploy Llama 3.1 8B Instruct
Deploy Llama 3.1 8B Instruct
Llama 3.1 8B is small enough to run on a wide range of hardware. For production use, we recommend an H100 or A10G GPU. On an H100, the model can run at full precision (no_quant) while still delivering blazing fast inference.
Configuration
The followingconfig.yaml serves Llama 3.1 8B using the Engine-Builder-LLM. Like the 70B variant, this model is gated and requires a Hugging Face access token.
Run inference
Llama 3.1 8B on Engine-Builder-LLM provides a full OpenAI-compatible API. Its smaller size makes it particularly responsive for streaming applications.- Python SDK
- cURL
Configuration and tuning
Despite its smaller size, Llama 3.1 8B supports a massive 128k context window, making it suitable for long-form document processing and retrieval-augmented generation (RAG).Hardware and precision
Because the 8B model is relatively small, it can run in full BF16/FP16 precision (no_quant) on modern GPUs like the H100. This ensures maximum model accuracy. If you are deploying on hardware with less VRAM, such as an A10G, you may want to consider FP8 quantization to increase throughput and support larger batch sizes.
Gated access
Remember to add yourhf_access_token to your Baseten workspace secrets. This token is required to download the model weights from Hugging Face during the build process.
Related
- Model APIs — Instant access to Llama models via shared endpoints.
- Llama 3.3 70B — For more complex reasoning and dialogue tasks.
- Truss examples — Source code for this Truss.