Structured output (JSON mode)
Enforce an output schema on LLM inference
Structured outputs requires an LLM deployed using the TensorRT-LLM Engine Builder.
If you want to try this structured output example code for yourself, deploy this implementation of Llama 3.1 8B.
To generate structured outputs:
- Define an object schema with Pydantic.
- Pass the schema to the LLM with the
response_format
argument. - Receive output that is guaranteed to match the provided schema, including types and validations like
max_length
.
Using structured output, you should observe approximately equivalent tokens per second output speed to an ordinary call to the model after an initial delay for schema processing. If you’re interested in the mechanisms behind structured output, check out this engineering deep dive on our blog.
Schema generation with Pydantic
Pydantic is an industry standard Python library for data validation. With Pydantic, we’ll build precise schemas for LLM output to match.
For example, here’s a schema for a basic Person
object.
Structured output supports multiple data types, required and optional fields, and additional validations like max_length
.
Add response format to LLM call
The first time that you pass a given schema for the model, it can take a minute for the schema to be processed and cached. Subsequent calls with the same schema will run at normal speeds.
Once your object is defined, you can add it as a parameter to your LLM call with the response_format
field:
The response may have an end of sequence token, which will need to be removed before the JSON can be parsed.
Parsing LLM output
From the LLM, we expect output in the following format:
This example output is valid, which can be double-checked with: