torch.compile artifacts, so other replicas and deployments can reuse them. It’s the supported path for runtime-written files that benefit from sharing. For read-only weights known at deploy time, use BDN instead.
How b10cache works
Deployments sometimes produce files that are useful to other replicas. Usingtorch.compile, for example, produces a cache that can speed up future torch.compile calls on the same function, reducing cold start time for other replicas.
b10cache stores these files. It’s a volume mounted over the network onto each of your pods, with two scopes:
Organization scope: /cache/org/
Shared across every pod you deploy in your organization. Move a file into this directory and any pod can read it.
Deployment scope: /cache/model/
Shared across every pod within a single deployment. Use this scope to keep deployment filesystems isolated.
Not persistent object storage
b10cache is reliable, but treat it as a cache, not a database. Always have a fallback path that runs if the file isn’t there yet. For example, the first replica of a new deployment writes to b10cache rather than reading from it.Torch compile caching
PyTorch’storch.compile can cut inference time by up to 40%, but compiling the model adds latency to cold starts: it must compile before serving its first request.
This overhead compounds in production, where:
- Models scale up and down with demand.
- New pods spawn to handle traffic spikes.
- Each new pod repeats the compilation from scratch.
Implementation options
There are two different deployment patterns that benefit from torch compile caching:Truss models (model.py)
API reference
We expose two API calls that return anOperationStatus object to help you control program flow based on the result.
load_compile_cache()
load_compile_cache()
If you have previously saved compilation cache for this model, load it to speed up the compilation for the model on this pod.Returns:
OperationStatus.SUCCESS→ successful loadOperationStatus.SKIPPED→ if torch compilation artifacts already exist on the podOperationStatus.ERROR→ general catch-all errorsOperationStatus.DOES_NOT_EXIST→ if no cache file was found
save_compile_cache()
save_compile_cache()
Save your model’s torch compilation cache for future use. This should be called after running prompts to warm up your model and trigger compilation.Returns:
OperationStatus.SUCCESS→ successful saveOperationStatus.SKIPPED→ skipped because compile cache already exists in shared directoryOperationStatus.ERROR→ general catch-all errors
Implementation example
Here is an example of compile caching for Flux, an image generation model. Note how we save the result ofload_compile_cache to inform on whether to save_compile_cache.
Update config.yaml
Under requirements, add b10-transfer:
Update model.py
Import the library and use the two functions to speed up torch compilation time:
See the full example.
vLLM servers (CLI tool)
Use this whenever you enable compile options with vLLM (compiling is the default on vLLM V1). The CLI tool runs automatically: it loads the compile cache if you’ve saved one before, and saves it otherwise. Make two changes inconfig.yaml:
Add requirements
Under requirements, addb10-transfer:
Update start command
Under start command, addb10-compile-cache & right before the vllm serve call:
See the full example.
Advanced configuration
Parameter overrides
Parameter overrides
The torch compile caching library supports several environment variables for fine-tuning behavior in production environments:
Cache directory configuration
TORCHINDUCTOR_CACHE_DIR (optional)- Default:
/tmp/torchinductor_<username> - Description: Directory where PyTorch stores compilation artifacts locally
- Allowed prefixes:
/tmp/,/cache/,~/.cache - Usage: Set this if you need to customize where torch compilation artifacts are stored on the local filesystem
B10FS_CACHE_DIR (optional)- Default: Derived from b10cache mount point +
/compile_cache - Description: Directory in b10cache where compilation artifacts are persisted across deployments
- Usage: Typically doesn’t need to be changed as it’s automatically configured based on your b10cache setup
LOCAL_WORK_DIR (optional)- Default:
/app - Description: Local working directory for temporary operations
- Allowed prefixes:
/app/,/tmp/,/cache/
Performance and resource limits
MAX_CACHE_SIZE_MB (optional)- Default:
1024(1GB) - Cap: Limited by
MAX_CACHE_SIZE_CAP_MBfor safety - Description: Maximum size of a single cache archive in megabytes
- Usage: Increase for larger models with extensive compilation artifacts, decrease to save storage
MAX_CONCURRENT_SAVES (optional)- Default:
50 - Cap: Limited by
MAX_CONCURRENT_SAVES_CAPfor safety - Description: Maximum number of concurrent save operations allowed
- Usage: Tune based on your deployment’s concurrency requirements and storage performance
Cleanup and maintenance
CLEANUP_LOCK_TIMEOUT_SECONDS (optional)- Default:
30 - Cap: Limited by
LOCK_TIMEOUT_CAP_SECONDS - Description: Timeout for cleaning up stale lock files, to prevent deadlocks. They may occur when a replica holding the lock crashes.
- Usage: Decrease if you’re experiencing deadlocks in high-load scenarios
CLEANUP_INCOMPLETE_TIMEOUT_SECONDS (optional)- Default:
60 - Cap: Limited by
INCOMPLETE_TIMEOUT_CAP_SECONDS - Description: Timeout for cleaning up incomplete cache files
- Usage: Increase for slower storage systems or larger cache files
Example configuration
The defaults suit most workloads. Tune them if a model needs a larger cache archive or hits contention on concurrent saves.