> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Llama 3.1

> Meta's Llama 3.1 8B instruction-tuned model. Runs on a single B200 from NVIDIA's FP8 checkpoint with EAGLE3 speculative decoding for high concurrent throughput.

<div className="capability-pills">
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

[nvidia/Llama-3.1-8B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8) is an 8B-parameter dense model with up to 128K context.

This preset serves Llama 3.1 8B Instruct on a single B200 through [Baseten Inference Stack](/engines/bis-llm/overview) (TensorRT-LLM) with FP8 weights, an FP8 KV cache, and EAGLE3 speculative decoding. It targets high concurrent throughput.

<CardGroup cols={4}>
  <Card title="Hardware" icon="microchip">B200</Card>
  <Card title="Engine" icon="server">TRT-LLM v2</Card>
  <Card title="Context" icon="ruler-horizontal">128K</Card>
  <Card title="Concurrency" icon="layer-group">512</Card>
</CardGroup>

## Write the config

Create and move into the project directory:

```sh theme={"system"}
mkdir llama-3.1-8b-instruct-throughput && cd llama-3.1-8b-instruct-throughput
```

Then create a file named `config.yaml` and paste the following:

```yaml config.yaml theme={"system"}
model_name: "model:llama-3.1-8b-instruct preset:throughput"
model_metadata:
  example_model_input:
    messages:
      - role: user
        content: "Write FizzBuzz in Python"
    stream: true
    model: "nvidia/Llama-3.1-8B-Instruct-FP8"
    max_tokens: 512
    temperature: 0.5
  tags:
    - openai-compatible

resources:
  accelerator: B200
  cpu: "1"
  memory: 10Gi
  use_gpu: true

weights:
  - source: "hf://nvidia/Llama-3.1-8B-Instruct-FP8@main"
    mount_location: "/app/model_cache/trt_model"
    auth_secret_name: "hf_access_token"
  - source: "hf://yuhuili/EAGLE3-LLaMA3.1-Instruct-8B@main"
    mount_location: "/app/model_cache/eagle3_draft"
    auth_secret_name: "hf_access_token"

secrets:
  hf_access_token: null

trt_llm:
  build:
    checkpoint_repository:
      repo: michaelfeil/empty-model
      revision: main
      source: HF
  inference_stack: v2
  runtime:
    enable_chunked_prefill: true
    max_batch_size: 512
    max_num_tokens: 16384
    max_seq_len: 131072
    tensor_parallel_size: 1
    served_model_name: nvidia/Llama-3.1-8B-Instruct-FP8
    patch_kwargs:
      model_path: /app/model_cache/trt_model
      backend: pytorch
      sampler_type: TorchSampler
      guided_decoding_backend: xgrammar
      max_beam_width: 1
      max_input_len: 131072
      trust_remote_code: 1
      cuda_graph_config:
        enable_padding: true
        max_batch_size: 512
      kv_cache_config:
        dtype: fp8
        enable_block_reuse: true
        free_gpu_memory_fraction: 0.9
      speculative_config:
        decoding_type: Eagle
        max_draft_len: 3
        speculative_model_dir: /app/model_cache/eagle3_draft
        eagle3_one_model: true
  version_overrides:
    v2_llm_version: null

runtime:
  predict_concurrency: 512
```

This config tells Baseten to compile a TensorRT-LLM engine for Llama 3.1 8B Instruct on a single B200, pulling FP8 weights from `nvidia/Llama-3.1-8B-Instruct-FP8` and an EAGLE3 draft speculator from `yuhuili/EAGLE3-LLaMA3.1-Instruct-8B`. The runtime is tuned for high concurrent throughput: 512 in-flight requests, chunked prefill, an FP8 KV cache, and CUDA graphs sized to the same batch ceiling so the engine stays hot under load.

## Key parameters

[Baseten Inference Stack](/engines/bis-llm/overview) (BIS) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

| Parameter            | Value                              |
| -------------------- | ---------------------------------- |
| Tensor parallel size | `1`                                |
| Max sequence length  | `131072`                           |
| Max batch size       | `512`                              |
| Max batched tokens   | `16384`                            |
| Chunked prefill      | `enabled`                          |
| Inference stack      | `v2`                               |
| Served model name    | `nvidia/Llama-3.1-8B-Instruct-FP8` |

## Deploy

Push the config to Baseten:

```sh theme={"system"}
uvx truss push
```

You should see output similar to:

```text theme={"system"}
✨ Model llama-3.1-8b-instruct-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
```

Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

## Call the model

Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

Now call your deployment to run inference:

<Tabs>
  <Tab title="Python">
    ```python main.py theme={"system"}
    import os
    from openai import OpenAI

    client = OpenAI(
        api_key=os.environ["BASETEN_API_KEY"],
        base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
    )

    response = client.chat.completions.create(
        model="nvidia/Llama-3.1-8B-Instruct-FP8",
        messages=[
            {"role": "user", "content": "What is machine learning?"}
        ],
    )

    print(response.choices[0].message.content)
    ```
  </Tab>

  <Tab title="cURL">
    ```sh theme={"system"}
    curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $BASETEN_API_KEY" \
      -d '{
        "model": "nvidia/Llama-3.1-8B-Instruct-FP8",
        "messages": [
          {"role": "user", "content": "What is machine learning?"}
        ]
      }'
    ```
  </Tab>
</Tabs>
