Pre/post-processing
Deploy a model that makes use of pre-process
Out of the box, Truss limits the amount of concurrent predicts that happen on single container. This ensures that the CPU, and for many models the GPU, do not get overloaded, and that the model can continue respond to requests in periods of high load
However, many models, in addition to having compute components, also have IO requirements. For example, a model that classifies images may need to download the image from a URL before it can classify it.
Truss provides a way to separate the IO component from the compute component, to ensure that any IO does not prevent utilization of the compute on your pod.
To do this, you can use the pre/post process methods on a Truss. These methods can be defined like this:
What happens when the model is invoked is that any logic defined in the pre or post-process methods happen on a separate thread, and are not subject to the same concurrency limits as predict. So β letβs say you have a model that can handle 5 concurrent requests:
If you hit it with 10 requests, they will all begin pre-processing, but then when the the 6th request is ready to begin the predict method, it will have to wait for one of the first 5 requests to finish. This ensures that the GPU is not overloaded, while also ensuring that the compute logic does not get blocked by IO, thereby ensuring that you can achieve maximum throughput.
predict
returns a generator (e.g. for streaming LLM outputs),
the model must not have a postprocessing
method
It can only be used when the prediction result is instantly available as a
whole. In case of streaming, move any postprocessing logic into predict
or
apply it client-side.