Model class in model/model.py is the imperative surface you reach for when config.yaml alone can’t express your logic. It gives you a Python class with lifecycle methods (__init__, load, and predict) that control how your model initializes, loads weights, and handles each request. When you need custom preprocessing, postprocessing, response shaping, or want to run an architecture that Baseten’s built-in engines don’t support, you write that logic here.
When to write a Model class
Most deployments don’t need custom Python. If you’re deploying a supported open-source model, the config-only approach in Build your first model is faster. Write a customModel class when you need to:
- Run a model architecture that Baseten’s engines don’t support.
- Add custom preprocessing or postprocessing around inference.
- Combine multiple models or libraries in a single endpoint.
- Control the HTTP response directly, including status codes and streaming.
model/model.py file. The simplest project structure is:
The class skeleton
model.py must contain a class with three methods:
model.py
__init__runs when the class is created. Read configuration parameters and runtime information here.loadruns once at startup, before any requests. Download model weights or load them onto a GPU here. Separating this from__init__keeps expensive operations out of the request path.predictruns on every API request. Process input, run inference, and return the response.
__init__
The __init__ method initializes the Model class. Use it to read configuration parameters and runtime information.
The simplest signature accepts nothing:
model.py
__init__ to accept these parameters:
model.py
config: A dictionary containing theconfig.yamlfor the model.data_dir: A string containing the path to the data directory for the model.secrets: A dictionary containing the secrets for the model. At runtime, these are populated with the actual values stored on Baseten.environment: A dictionary containing the environment for the model, if the model has been deployed to an environment.Noneotherwise.
model.py
**kwargs and pull out only what you need:
model.py
load
The load method initializes the model. This might include downloading model weights or loading them onto the GPU. Unlike the other methods, load accepts no parameters:
model.py
load completes successfully. There is a timeout of 30 minutes for this, after which the deployment is marked as failed if load hasn’t completed.
predict
The predict method runs inference. The simplest signature returns a value directly:
model.py
predict must be JSON-serializable, so it can be a dict, list, or str. See Response objects for stricter typing and direct control over the HTTP response.
Async vs. sync
Thepredict method is synchronous by default. If your inference depends on APIs that require asyncio, write predict as a coroutine:
model.py
Pre/post-processing
To separate I/O from inference and maximize throughput, define optionalpreprocess and postprocess methods alongside predict. Tasks like downloading images or formatting responses then run without blocking GPU or CPU execution:
model.py
predict to prevent GPU or CPU overload:
config.yaml
predict_concurrency: 5, all 10 start preprocessing concurrently, but only 5 run inference at a time. The rest wait until a slot frees up.
Streaming
Truss also supports streaming output incrementally instead of waiting for the full response. For the full pattern, see Streaming output and endpoints.Response objects
By default, Truss wraps prediction results into an HTTP response. For advanced use cases, create response objects manually to:- Control HTTP status codes.
- Use server-sent events (SSEs) for streaming responses.
dict, list, or str, return a Pydantic model:
model.py
starlette.responses.Response:
model.py
StreamingResponse. See Streaming output and endpoints for a complete SSE example.
To handle raw incoming requests, see Using request objects.
Bundled data
Most models need additional files at runtime, such as weights, tokenizers, configs, or reference datasets. For local files under ~1 GB total, bundle them in your Truss’sdata/ directory. The contents are copied into your container image at build time and mounted at /app/data at runtime.
Access them from model.py through kwargs["data_dir"]:
model.py
data/ directory out like this Stable Diffusion 2.1 example:
Download files at runtime
Use this pattern when you need fine-grained control over the download, such as decrypting files on the fly or lazily fetching a subset of a larger dataset. The example below loads weights from a private S3 bucket usingboto3.
To load private S3 weights at deploy time, prefer BDN with IAM credentials. BDN mirrors the weights once and serves them from a multi-tier cache; the pattern below re-downloads on every cold start unless you add caching.
config.yaml:
model.py, then deploy with truss push --watch:
model.py
Next steps
- HTTP endpoints: Add
chat_completions,completions,embeddings,messages, orresponsesto serve matching/v1/*routes. - Streaming output and endpoints: Return generated output incrementally.
- Custom health checks: Define readiness and liveness behavior.
- Configuration: Full reference for
config.yamloptions. - Model weights: Fetch large weights through BDN instead of bundling them, and cache runtime-written files with runtime caching.