Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.
Sign in to Basetenuvx truss login --browser
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 is a 120B-parameter MoE model with 12B active per token.
This preset serves Nemotron 3 Super 120B A12B on B200:2 through Baseten Inference Stack (TensorRT-LLM) with NVFP4 weights, attention data parallelism, expert parallelism, and MTP speculative decoding. It targets high-throughput reasoning.
Write the config
Create and move into the project directory:
mkdir nemotron-3-super-120b-a12b-throughput && cd nemotron-3-super-120b-a12b-throughput
Then create a file named config.yaml and paste the following:
model_name: model:nemotron-3-super-120b-a12b preset:throughput
model_metadata:
example_model_input:
model: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4"
max_tokens: 512
messages:
- role: user
content: Tell me everything you know about optimized inference.
stream: true
temperature: 0.5
tags:
- openai-compatible
resources:
accelerator: B200:4
cpu: "1"
memory: 10Gi
use_gpu: true
environment_variables:
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
TRTLLM_ENABLE_PDL: "1"
BAD_TOKEN_ID_SEQ_CHECK_ENABLED: "1"
ENABLE_B10_LOOKAHEAD: "0"
secrets:
hf_access_token: null
trt_llm:
inference_stack: v2
build:
checkpoint_repository:
repo: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
revision: main
source: HF
runtime_secret_name: hf_access_token
runtime:
enable_chunked_prefill: true
max_batch_size: 32
max_num_tokens: 16384
max_seq_len: 131072
tensor_parallel_size: 4
served_model_name: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
patch_kwargs:
reasoning_parser: nemotron3
tool_call_parser: qwen3_coder
tokenizer_limit_length: 131072
arguments_as_json: true
engine_config:
backend: pytorch
enable_chunked_prefill: true
enable_iter_perf_stats: true
max_batch_size: 32
max_beam_width: 1
max_input_len: 131072
max_num_tokens: 16384
max_seq_len: 131072
trust_remote_code: true
moe_expert_parallel_size: 4
cuda_graph_config:
enable_padding: true
max_batch_size: 32
kv_cache_config:
dtype: fp8
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
mamba_ssm_cache_dtype: float32
moe_config:
backend: TRTLLM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
allow_advanced_sampling: true
This config tells Baseten to compile a TensorRT-LLM engine for Nemotron 3 Super 120B A12B on two B200 GPUs with NVFP4-quantized weights, wiring the Qwen3-coder tool-call parser and Nemotron 3 reasoning parser into the engine config. Attention data parallelism, expert parallelism across both GPUs, MTP speculative decoding with three draft tokens, and chunked prefill combine to push high reasoning throughput within the runtime’s 8K sequence cap.
Key parameters
Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:
| Parameter | Value |
|---|
| Tensor parallel size | 4 |
| Max sequence length | 131072 |
| Max batch size | 32 |
| Max batched tokens | 16384 |
| Chunked prefill | enabled |
| Inference stack | v2 |
| Served model name | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 |
Deploy
Push the config to Baseten:
You should see output similar to:
✨ Model nemotron-3-super-120b-a12b-throughput was successfully pushed ✨
🪵 View logs for your deployment at https://app.baseten.co/models/abcd1234/logs/wxyz5678
Your model ID is the string after /models/ in the logs URL (abcd1234 in the example). Use it wherever you see {model_id} in the next section.
Call the model
Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.
Now call your deployment to run inference:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'