torch.compile artifacts, so other replicas and deployments can reuse them. It’s the supported path for runtime-written files that benefit from sharing. For read-only weights known at deploy time, use BDN instead.
How b10cache works
Deployments sometimes produce files that are useful to other replicas. Usingtorch.compile, for example, produces a cache that can speed up future torch.compile calls on the same function, reducing cold start time for other replicas.
b10cache stores these files. It’s a volume mounted over the network onto each of your pods, with two scopes:
Organization scope: /cache/org/
Shared across every pod you deploy in your organization. Move a file into this directory and any pod can read it.
Deployment scope: /cache/model/
Shared across every pod within a single deployment. Use this scope to keep deployment filesystems isolated.
Not persistent object storage
b10cache is reliable, but treat it as a cache, not a database. Always have a fallback path that runs if the file isn’t there yet. For example, the first replica of a new deployment writes to b10cache rather than reading from it.Torch compile caching
PyTorch’storch.compile can cut inference time by up to 40%, but compiling the model adds latency to cold starts: it must compile before serving its first request.
This overhead compounds in production, where:
- Models scale up and down with demand.
- New pods spawn to handle traffic spikes.
- Each new pod repeats the compilation from scratch.
Implementation options
There are two different deployment patterns that benefit from torch compile caching:- Truss models: a
model.pythat callstorch.compile. See Truss models. - vLLM servers: a vLLM custom server. See vLLM servers.
Truss models (model.py)
API reference
We expose two API calls that return anOperationStatus object to help you control program flow based on the result.
load_compile_cache()
load_compile_cache()
If you have previously saved compilation cache for this model, load it to speed up the compilation for the model on this pod.Returns:
OperationStatus.SUCCESS→ successful loadOperationStatus.SKIPPED→ if torch compilation artifacts already exist on the podOperationStatus.ERROR→ general catch-all errorsOperationStatus.DOES_NOT_EXIST→ if no cache file was found
save_compile_cache()
save_compile_cache()
Save your model’s torch compilation cache for future use. This should be called after running prompts to warm up your model and trigger compilation.Returns:
OperationStatus.SUCCESS→ successful saveOperationStatus.SKIPPED→ skipped because compile cache already exists in shared directoryOperationStatus.ERROR→ general catch-all errors
Implementation example
Here is an example of compile caching for Flux, an image generation model. Note how we save the result ofload_compile_cache to inform on whether to save_compile_cache.
Update config.yaml
Under requirements, add b10-transfer:
config.yaml
Update model.py
Import the library and use the two functions to speed up torch compilation time:
model.py
See the full example.
vLLM servers (CLI tool)
Use this whenever you enable compile options with vLLM (compiling is the default on vLLM V1). The CLI tool runs automatically: it loads the compile cache if you’ve saved one before, and saves it otherwise. Make two changes inconfig.yaml:
Add requirements
Under requirements, addb10-transfer:
config.yaml
Update start command
Under start command, addb10-compile-cache & right before the vllm serve call:
config.yaml
See the full example.
Advanced configuration
Parameter overrides
Parameter overrides
The torch compile caching library supports several environment variables for fine-tuning behavior in production environments:
Cache directory configuration
TORCHINDUCTOR_CACHE_DIR (optional)- Default:
/tmp/torchinductor_<username> - Description: Directory where PyTorch stores compilation artifacts locally
- Allowed prefixes:
/tmp/,/cache/,~/.cache - Usage: Set this if you need to customize where torch compilation artifacts are stored on the local filesystem
B10FS_CACHE_DIR (optional)- Default: Derived from b10cache mount point +
/compile_cache - Description: Directory in b10cache where compilation artifacts are persisted across deployments
- Usage: Typically doesn’t need to be changed as it’s automatically configured based on your b10cache setup
LOCAL_WORK_DIR (optional)- Default:
/app - Description: Local working directory for temporary operations
- Allowed prefixes:
/app/,/tmp/,/cache/
Performance and resource limits
MAX_CACHE_SIZE_MB (optional)- Default:
1024(1GB) - Cap: Limited by
MAX_CACHE_SIZE_CAP_MBfor safety - Description: Maximum size of a single cache archive in megabytes
- Usage: Increase for larger models with extensive compilation artifacts, decrease to save storage
MAX_CONCURRENT_SAVES (optional)- Default:
50 - Cap: Limited by
MAX_CONCURRENT_SAVES_CAPfor safety - Description: Maximum number of concurrent save operations allowed
- Usage: Tune based on your deployment’s concurrency requirements and storage performance
Cleanup and maintenance
CLEANUP_LOCK_TIMEOUT_SECONDS (optional)- Default:
30 - Cap: Limited by
LOCK_TIMEOUT_CAP_SECONDS - Description: Timeout for cleaning up stale lock files, to prevent deadlocks. They may occur when a replica holding the lock crashes.
- Usage: Decrease if you’re experiencing deadlocks in high-load scenarios
CLEANUP_INCOMPLETE_TIMEOUT_SECONDS (optional)- Default:
60 - Cap: Limited by
INCOMPLETE_TIMEOUT_CAP_SECONDS - Description: Timeout for cleaning up incomplete cache files
- Usage: Increase for slower storage systems or larger cache files
Example configuration
config.yaml
The defaults suit most workloads. Tune them if a model needs a larger cache archive or hits contention on concurrent saves.
Next steps
BDN
Cache read-only weights known at deploy time
Performance optimization
Reduce latency and cold starts across your deployment