Skip to main content

Deploy Llama 3.1 8B Instruct

Llama 3.1 8B Instruct is an exceptionally versatile model that offers a superior balance of performance and efficiency. It is the ideal choice for applications requiring low latency, such as real-time chatbots, or for deployments where minimizing hardware costs is a priority. On Baseten, you can serve Llama 3.1 8B using the Engine-Builder-LLM.

Deploy Llama 3.1 8B Instruct

Llama 3.1 8B is small enough to run on a wide range of hardware. For production use, we recommend an H100 or A10G GPU. On an H100, the model can run at full precision (no_quant) while still delivering blazing fast inference.

Configuration

The following config.yaml serves Llama 3.1 8B using the Engine-Builder-LLM. Like the 70B variant, this model is gated and requires a Hugging Face access token.
model_name: llama-3-1-8b
resources:
  accelerator: H100
  cpu: '1'
  memory: 24Gi
  use_gpu: true
secrets:
  hf_access_token: null
trt_llm:
  build:
    checkpoint_repository:
      repo: meta-llama/Llama-3.1-8B-Instruct
      source: HF
    num_builder_gpus: 1
    quantization_type: no_quant
    max_seq_len: 131072
    tensor_parallel_count: 1
  runtime:
    enable_chunked_context: true
    request_default_max_tokens: 131072
Deploy the model with:
truss push

Run inference

Llama 3.1 8B on Engine-Builder-LLM provides a full OpenAI-compatible API. Its smaller size makes it particularly responsive for streaming applications.
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ['BASETEN_API_KEY'],
    base_url="https://model-{MODEL_ID}.api.baseten.co/environments/production/sync/v1"
)

response = client.chat.completions.create(
    model="llama-3-1-8b",
    messages=[
        {"role": "user", "content": "What are three fun facts about Llamas?"}
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

Configuration and tuning

Despite its smaller size, Llama 3.1 8B supports a massive 128k context window, making it suitable for long-form document processing and retrieval-augmented generation (RAG).

Hardware and precision

Because the 8B model is relatively small, it can run in full BF16/FP16 precision (no_quant) on modern GPUs like the H100. This ensures maximum model accuracy. If you are deploying on hardware with less VRAM, such as an A10G, you may want to consider FP8 quantization to increase throughput and support larger batch sizes.

Gated access

Remember to add your hf_access_token to your Baseten workspace secrets. This token is required to download the model weights from Hugging Face during the build process.