> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Holo 3.1

> H Company's Holo 3.1 is a 35B-parameter MoE vision-language model with 3B active parameters per token, built on the Qwen3.6-35B-A3B base. It accepts image input and returns reasoning output and native tool calls.

<div className="capability-pills">
  <a href="/examples/models/capabilities/agentic" className="capability-pill">Agentic</a>
  <a href="/examples/models/capabilities/multimodal-image" className="capability-pill">Multimodal (image)</a>
  <a href="/examples/models/capabilities/reasoning" className="capability-pill">Reasoning</a>
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

[Hcompany/Holo-3.1-35B-A3B-FP8](https://huggingface.co/Hcompany/Holo-3.1-35B-A3B-FP8) is a 35B-parameter MoE model (3B active per token) with up to 256K context.

This preset serves Holo 3.1 35B-A3B on H100 GPUs with FP8 weights, optimized for high-concurrency throughput on agent and vision workloads.

<CardGroup cols={4}>
  <Card title="Hardware" icon="microchip">H100</Card>
  <Card title="Engine" icon="server">vLLM (0.20.2-cu129 build)</Card>
  <Card title="Context" icon="ruler-horizontal">256K</Card>
  <Card title="Concurrency" icon="layer-group">1000</Card>
</CardGroup>

## Write the config

Create and move into the project directory:

```sh theme={"system"}
mkdir holo-3.1-35b-a3b-throughput && cd holo-3.1-35b-a3b-throughput
```

Then create a file named `config.yaml` and paste the following:

```yaml config.yaml theme={"system"}
model_name: "model:holo-3.1-35b-a3b preset:throughput"
model_metadata:
  description: >-
    Holo-3.1-35B-A3B (FP8), H Company's computer-use / GUI-agent vision-language
    model built on the Qwen3.6-35B-A3B MoE base. OpenAI-compatible multimodal chat
    with image input and native function calling, served via vLLM.
  repo_id: Hcompany/Holo-3.1-35B-A3B-FP8
  example_model_input:
    model: Hcompany/Holo-3.1-35B-A3B-FP8
    messages:
      - role: user
        content:
          - type: text
            text: "Describe this image in one sentence."
          - type: image_url
            image_url:
              url: "https://picsum.photos/id/237/200/300"
    stream: true
    max_tokens: 512
    temperature: 1.0
  tags:
    - openai-compatible
    - vllm
    - holo3.1
    - fp8
    - h100
    - multimodal
base_image:
  image: vllm/vllm-openai:v0.20.2-cu129
weights:
  - source: "hf://Hcompany/Holo-3.1-35B-A3B-FP8@main"
    mount_location: "/app/checkpoint/model"
    auth_secret_name: "hf_access_token"
build_commands: []
environment_variables:
  PYTORCH_ALLOC_CONF: "expandable_segments:True"
  VLLM_LOGGING_LEVEL: WARNING
  VLLM_ENGINE_READY_TIMEOUT_S: "3600"
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
    --tensor-parallel-size $GPU_COUNT
    --served-model-name Hcompany/Holo-3.1-35B-A3B-FP8
    --host 0.0.0.0
    --port 8000
    --gpu-memory-utilization 0.95
    --max-model-len 262144
    --max-num-batched-tokens 32768
    --dtype auto
    --enable-chunked-prefill
    --enable-prefix-caching
    --max-num-seqs 512
    --limit-mm-per-prompt.image 2
    --reasoning-parser qwen3
    --enable-auto-tool-choice
    --tool-call-parser qwen3_coder
    --trust-remote-code"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
runtime:
  predict_concurrency: 1000
  health_checks:
    restart_check_delay_seconds: 1500
    restart_threshold_seconds: 1500
    stop_traffic_threshold_seconds: 120
resources:
  accelerator: H100
  use_gpu: true
secrets:
  hf_access_token: null
```

## Flags

The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

| Flag                          | Value         | What it does                                                                                                                                                |
| ----------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--tensor-parallel-size`      | `$GPU_COUNT`  | Number of GPUs to shard the model across.                                                                                                                   |
| `--gpu-memory-utilization`    | `0.95`        | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
| `--max-model-len`             | `262144`      | Maximum context length (tokens) the server accepts per request.                                                                                             |
| `--max-num-batched-tokens`    | `32768`       | Maximum total tokens processed per scheduler step.                                                                                                          |
| `--dtype`                     | `auto`        | Weight precision loaded at runtime. **auto:** Match the model's checkpoint dtype (default).                                                                 |
| `--enable-chunked-prefill`    | (no value)    | Process long prompts in chunks so decode requests keep running.                                                                                             |
| `--enable-prefix-caching`     | (no value)    | Reuse KV cache across requests that share a prefix.                                                                                                         |
| `--max-num-seqs`              | `512`         | Maximum number of concurrent sequences in the batch.                                                                                                        |
| `--limit-mm-per-prompt.image` | `2`           | Maximum number of image inputs per prompt.                                                                                                                  |
| `--reasoning-parser`          | `qwen3`       | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
| `--enable-auto-tool-choice`   | (no value)    | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
| `--tool-call-parser`          | `qwen3_coder` | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
| `--trust-remote-code`         | (no value)    | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |

## Deploy

Push the config to Baseten:

```sh theme={"system"}
uvx truss push
```

You should see output similar to:

```output theme={"system"}
✨ Model holo-3.1-35b-a3b-throughput was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
```

Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

## Call the model

Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

Now call your deployment to run inference:

<Tabs>
  <Tab title="Python">
    ```python main.py theme={"system"}
    import os
    from openai import OpenAI

    client = OpenAI(
        api_key=os.environ["BASETEN_API_KEY"],
        base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
    )

    response = client.chat.completions.create(
        model="Hcompany/Holo-3.1-35B-A3B-FP8",
        messages=[
            {"role": "user", "content": "What is machine learning?"}
        ],
    )

    print(response.choices[0].message.content)
    ```
  </Tab>

  <Tab title="cURL">
    ```sh theme={"system"}
    curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $BASETEN_API_KEY" \
      -d '{
        "model": "Hcompany/Holo-3.1-35B-A3B-FP8",
        "messages": [
          {"role": "user", "content": "What is machine learning?"}
        ]
      }'
    ```
  </Tab>
</Tabs>

To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

```python theme={"system"}
response = client.chat.completions.create(
    model="Hcompany/Holo-3.1-35B-A3B-FP8",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer
```

To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

```python theme={"system"}
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="Hcompany/Holo-3.1-35B-A3B-FP8",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)
```
