> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Llama 4

> Meta's Llama 4 Scout is a 17B-active MoE with native multimodal support and a 10M token context window.

<div className="capability-pills">
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
  <a href="/examples/models/capabilities/multimodal-image" className="capability-pill">Multimodal (image)</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

[meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) is a 109B-parameter MoE model (17B active per token) with up to 10M context.

This preset serves Llama 4 Scout on H100:4 with a 128K serving context and native multimodal support.

<CardGroup cols={4}>
  <Card title="Hardware" icon="microchip">H100 × 4</Card>
  <Card title="Engine" icon="server">vLLM (latest build)</Card>
  <Card title="Context" icon="ruler-horizontal">128K</Card>
  <Card title="Concurrency" icon="layer-group">256</Card>
</CardGroup>

## Write the config

Create and move into the project directory:

```sh theme={"system"}
mkdir llama-4-scout-latency && cd llama-4-scout-latency
```

Then create a file named `config.yaml` and paste the following:

```yaml config.yaml theme={"system"}
base_image:
  image: vllm/vllm-openai:latest
build_commands:
  - pip install hf-xet
model_metadata:
  repo_id: meta-llama/Llama-4-Scout-17B-16E-Instruct
  example_model_input: {
    "model": "llama",
    "messages": [
      {
      "role": "user",
      "content": "Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]:"
      }
    ],
    "stream": true,
    "max_tokens": 512,
    "temperature": 0.5
  }
  tags:
  - openai-compatible
docker_server:
  start_command: >-
    sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct
    --served-model-name llama
    --max-model-len 131072
    --tensor-parallel-size 4
    --distributed-executor-backend mp
    --gpu-memory-utilization 0.95
    --kv-cache-dtype fp8
    --limit-mm-per-prompt '{\"image\": 10}'
    --override-generation-config='{\"attn_temperature_tuning\": true}'
    --host 0.0.0.0
    --port 8000"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
environment_variables:
  hf_access_token: null
resources:
  accelerator: H100:4
  use_gpu: true
secrets:
  hf_access_token: null
runtime:
  predict_concurrency: 256
model_name: "model:llama-4-scout preset:latency"
```

## Flags

The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

| Flag                             | Value                               | What it does                                                                                                         |
| -------------------------------- | ----------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `--max-model-len`                | `131072`                            | Maximum context length (tokens) the server accepts per request.                                                      |
| `--tensor-parallel-size`         | `4`                                 | Number of GPUs to shard the model across.                                                                            |
| `--distributed-executor-backend` | `mp`                                | How vLLM coordinates tensor-parallel workers across processes. **mp:** Python multiprocessing (single-node default). |
| `--gpu-memory-utilization`       | `0.95`                              | Fraction of GPU memory vLLM may use for weights and KV cache.                                                        |
| `--kv-cache-dtype`               | `fp8`                               | KV cache numeric precision. **fp8:** \~2× KV cache density with negligible quality impact on most models.            |
| `--limit-mm-per-prompt`          | `{"image": 10}`                     | Max multimodal inputs accepted per prompt (JSON object keyed by modality).                                           |
| `--override-generation-config`   | `{"attn_temperature_tuning": true}` | JSON overrides applied on top of the model's default generation config.                                              |

## Deploy

Push the config to Baseten:

```sh theme={"system"}
uvx truss push
```

You should see output similar to:

```text theme={"system"}
✨ Model llama-4-scout-latency was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
```

Your **model ID** is the string after `/models/` in the logs URL (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

## Call the model

Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

Now call your deployment to run inference:

<Tabs>
  <Tab title="Python">
    ```python main.py theme={"system"}
    import os
    from openai import OpenAI

    client = OpenAI(
        api_key=os.environ["BASETEN_API_KEY"],
        base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
    )

    response = client.chat.completions.create(
        model="llama",
        messages=[
            {"role": "user", "content": "What is machine learning?"}
        ],
    )

    print(response.choices[0].message.content)
    ```
  </Tab>

  <Tab title="cURL">
    ```sh theme={"system"}
    curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $BASETEN_API_KEY" \
      -d '{
        "model": "llama",
        "messages": [
          {"role": "user", "content": "What is machine learning?"}
        ]
      }'
    ```
  </Tab>
</Tabs>
