Truss processes client requests by extracting and validating payloads. For advanced use cases, you can access the raw request object to:
- Customize payload deserialization (e.g., binary protocol buffers).
- Handle disconnections & cancel long-running predictions.
You can mix request objects with standard inputs or use requests exclusively for performance optimization.
Using Request Objects in Truss
You can define request objects in preprocess, predict, and postprocess:
import fastapi
class Model:
def preprocess(self, request: fastapi.Request):
...
def predict(self, inputs, request: fastapi.Request):
...
def postprocess(self, inputs, request: fastapi.Request):
...
Rules for Using Requests
- The request must be type-annotated as
fastapi.Request.
- If only requests are used, Truss skips payload extraction for better performance.
- If both request objects and standard inputs are used:
- Request must be the second argument.
- Preprocessing transforms inputs, but the request object remains unchanged.
postprocess cannot use only the requestβit must receive the modelβs output.
- If
predict only uses the request, preprocess cannot be used.
import fastapi, asyncio, logging
class Model:
async def predict(self, inputs, request: fastapi.Request):
await asyncio.sleep(1)
if await request.is_disconnected():
logging.warning("Cancelled before generation.")
return # Cancel request on the model engine here.
for i in range(5):
await asyncio.sleep(1.0)
logging.warning(i)
yield str(i) # Streaming response
if await request.is_disconnected():
logging.warning("Cancelled during generation.")
return # Cancel request on the model engine here.
You must implement request cancellation at the model level, which varies by framework.
Cancelling Requests in Specific Frameworks
TRT-LLM (Polling-Based Cancellation)
For TensorRT-LLM, use response_iterator.cancel() to terminate streaming requests:
async for request_output in response_iterator:
if await is_cancelled_fn():
logging.info("Request cancelled. Cancelling Triton request.")
response_iterator.cancel()
return
vLLM (Abort API)
For vLLM, use engine.abort() to stop processing:
async for request_output in results_generator:
if await request.is_disconnected():
await engine.abort(request_id)
return
Unsupported Request Features
- Streaming file uploads β Use URLs instead of embedding large data in the request.
- Client-side headers β Most headers are stripped; include necessary metadata in the payload.