Laguna

Agentic Reasoning Tool calling Long context

Poolside’s Laguna M.1 is a Mixture-of-Experts reasoning model tuned for agentic coding and extended reasoning, served from an FP8 checkpoint.

Setup

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

This preset serves Laguna M.1 on H100:4 with FP8 weights, optimized for low time-to-first-token on interactive reasoning and coding workloads.

Hardware

H100 × 4

Engine

vLLM 0.21.0

Context

256K

Concurrency

Write the config

Create and move into the project directory:

mkdir laguna-m.1-latency && cd laguna-m.1-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:laguna-m.1 preset:latency"

model_metadata:
  description: >-
    Laguna M.1 FP8 MoE reasoning model from Poolside, served with vLLM (H100 TP=4),
    OpenAI-compatible chat with tool calling and extended reasoning support.
    Latency-optimized: low max-num-seqs to minimize head-of-line blocking from long thinking traces.
  repo_id: poolside/Laguna-M.1-FP8
  trust_remote_code: true
  tags:
    - openai-compatible
    - vllm
    - moe
    - reasoning
    - agentic-coding
    - fp8
  example_model_input:
    model: poolside/laguna-m.1
    messages:
      - role: user
        content: "Write a Python retry wrapper with exponential backoff."
    stream: true
    temperature: 1.0
    top_k: 20

# ---------------------------------------------------------------------------
# Base image — vLLM with Laguna support (requires vLLM >= 0.21.0)
# ---------------------------------------------------------------------------
base_image:
  image: vllm/vllm-openai:v0.21.0
  python_executable_path: /usr/bin/python3

# ---------------------------------------------------------------------------
# Weights — FP8 quantized checkpoint (~225 GB, fits in 4× H100 / 320 GB)
# Quantization is detected automatically from the checkpoint's
# quantization_config — no extra vLLM flags needed.
# ---------------------------------------------------------------------------
weights:
  - source: "hf://poolside/Laguna-M.1-FP8"
    mount_location: "/models/laguna-m1"

environment_variables:
  VLLM_LOGGING_LEVEL: WARNING
  VLLM_ENGINE_READY_TIMEOUT_S: "3600"

# ---------------------------------------------------------------------------
# Docker server — vLLM OpenAI-compatible endpoint
# ---------------------------------------------------------------------------
docker_server:
  start_command: >
    vllm serve /models/laguna-m1
    --served-model-name poolside/laguna-m.1
    --host 0.0.0.0
    --port 8000
    --tool-call-parser poolside_v1
    --reasoning-parser poolside_v1
    --enable-auto-tool-choice
    --default-chat-template-kwargs '{"enable_thinking": true}'
    --tensor-parallel-size 4
    --max-model-len 262144
    --max-num-seqs 64
    --gpu-memory-utilization 0.95
    --trust-remote-code
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000

# ---------------------------------------------------------------------------
# Resources
# FP8 ~225 GB → 4× H100 (320 GB total VRAM) with comfortable headroom
# ---------------------------------------------------------------------------
resources:
  accelerator: H100:4
  cpu: "8"
  memory: 32Gi
  use_gpu: true

# ---------------------------------------------------------------------------
# Runtime
# ---------------------------------------------------------------------------
runtime:
  predict_concurrency: 64
  health_checks:
    restart_check_delay_seconds: 1800
    restart_threshold_seconds: 600
    stop_traffic_threshold_seconds: 180

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tool-call-parser`	`poolside_v1`	Server-side parser that emits structured `tool_calls` on the response.
`--reasoning-parser`	`poolside_v1`	Server-side parser that separates reasoning output into `reasoning_content`.
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--default-chat-template-kwargs`	`{"enable_thinking": true}`	Default keyword arguments applied to the chat template, used to set behaviors like enabling reasoning by default.
`--tensor-parallel-size`	`4`	Number of GPUs to shard the model across.
`--max-model-len`	`262144`	Maximum context length (tokens) the server accepts per request.
`--max-num-seqs`	`64`	Maximum number of concurrent sequences in the batch.
`--gpu-memory-utilization`	`0.95`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model laguna-m.1-latency was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

truss push prints your model ID (abc1d2ef in the example). The examples below use it wherever you see {model_id}, and read your API key from the BASETEN_API_KEY environment variable.

Call the model

Your deployment serves an OpenAI-compatible API. Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="poolside/laguna-m.1",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "poolside/laguna-m.1",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

The server parses the model’s chain of thought into a separate reasoning_content field on the response. Read it alongside the final answer:

response = client.chat.completions.create(
    model="poolside/laguna-m.1",
    messages=[
        {"role": "user", "content": "How many r's in strawberry?"}
    ],
)
print(response.choices[0].message.reasoning_content)  # chain of thought
print(response.choices[0].message.content)            # final answer

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="poolside/laguna-m.1",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

Examples

Models

Engines

Custom Docker servers

Custom Python models

Chains

Setup

Hardware

Engine

Context

Concurrency

Write the config

Flags

Deploy

Call the model

Next steps

Call your model

Autoscaling

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

​Next steps

Call your model

Autoscaling

Setup

Write the config

Flags

Deploy

Call the model

Next steps