Runtime caching

b10cache stores files your model writes at runtime, such as torch.compile artifacts, so other replicas and deployments can reuse them. It’s the supported path for runtime-written files that benefit from sharing. For read-only weights known at deploy time, use BDN instead.

How b10cache works

Deployments sometimes produce files that are useful to other replicas. Using torch.compile, for example, produces a cache that can speed up future torch.compile calls on the same function, reducing cold start time for other replicas. b10cache stores these files. It’s a volume mounted over the network onto each of your replicas, with two scopes:

Organization scope: `/cache/org/`

Shared across every replica you deploy in your organization. Move a file into this directory and any replica can read it.

Deployment scope: `/cache/model/`

Shared across every replica within a single deployment. Use this scope to keep deployment filesystems isolated.

Not persistent object storage

b10cache is reliable, but treat it as a cache, not a database. Always have a fallback path that runs if the file isn’t there yet. For example, the first replica of a new deployment writes to b10cache rather than reading from it.

Torch compile caching

PyTorch’s torch.compile can cut inference time by up to 40%, but compiling the model adds latency to cold starts: it must compile before serving its first request. This overhead compounds in production, where:

Models scale up and down with demand.
New replicas spawn to handle traffic spikes.
Each new replica repeats the compilation from scratch.

Torch compile caching persists compilation artifacts across deployments and replica restarts in b10cache, so a new replica loads them instead of recompiling. The library handles large scale-ups, managing race conditions and staying fault-tolerant on the shared cache. In practice, this strategy reduces compilation latencies to roughly 5 to 20 seconds, depending on the model.

Implementation options

There are two different deployment patterns that benefit from torch compile caching:

Truss models: a model.py that calls torch.compile. See Truss models.
vLLM servers: a vLLM custom server. See vLLM servers.

Truss models (`model.py`)

API reference

We expose two API calls that return an OperationStatus object to help you control program flow based on the result.

load_compile_cache()

If you have previously saved compilation cache for this model, load it to speed up the compilation for the model on this replica.Returns:

OperationStatus.SUCCESS → successful load
OperationStatus.SKIPPED → if torch compilation artifacts already exist on the replica
OperationStatus.ERROR → general catch-all errors
OperationStatus.DOES_NOT_EXIST → if no cache file was found

save_compile_cache()

Save your model’s torch compilation cache for future use. This should be called after running prompts to warm up your model and trigger compilation.Returns:

OperationStatus.SUCCESS → successful save
OperationStatus.SKIPPED → skipped because compile cache already exists in shared directory
OperationStatus.ERROR → general catch-all errors

Implementation example

Here is an example of compile caching for Flux, an image generation model. Note how we save the result of load_compile_cache to inform on whether to save_compile_cache.

Update `config.yaml`

Under requirements, add b10-transfer:

config.yaml

requirements:
  - b10-transfer

Update `model.py`

Import the library and use the two functions to speed up torch compilation time:

model.py

from b10_transfer import load_compile_cache, save_compile_cache, OperationStatus

class Model:
    def load(self):
        self.pipe = FluxPipeline.from_pretrained(
            self.model_name, torch_dtype=torch.bfloat16, token=self.hf_access_token
        ).to("cuda")

        # Try to load compile cache
        cache_loaded: OperationStatus = load_compile_cache()

        if cache_loaded == OperationStatus.ERROR:
            logging.info("Run in eager mode, skipping torch compile")
        else:
            logging.info("Compiling the model for performance optimization")
            self.pipe.transformer = torch.compile(
                self.pipe.transformer, mode="max-autotune-no-cudagraphs", dynamic=False
            )

            self.pipe.vae.decode = torch.compile(
                self.pipe.vae.decode, mode="max-autotune-no-cudagraphs", dynamic=False
            )

            seed = random.randint(0, MAX_SEED)
            generator = torch.Generator().manual_seed(seed)
            start_time = time.time()
            # Warmup the model with dummy prompts, also triggering compilation
            self.pipe(
                prompt="dummy prompt",
                prompt_2=None,
                guidance_scale=0.0,
                max_sequence_length=256,
                num_inference_steps=4,
                width=1024,
                height=1024,
                output_type="pil",
                generator=generator
            )

            end_time = time.time()

            logging.info(
                f"Warmup completed in {(end_time - start_time)} seconds. "
                "This is expected to take a few minutes on the first run."
            )

            if cache_loaded != OperationStatus.SUCCESS:
                # Save compile cache for future runs
                outcome: OperationStatus = save_compile_cache()

See the full example.

vLLM servers (CLI tool)

Use this whenever you enable compile options with vLLM (compiling is the default on vLLM V1). The CLI tool runs automatically: it loads the compile cache if you’ve saved one before, and saves it otherwise. Make two changes in config.yaml:

Add requirements

Under requirements, add b10-transfer:

config.yaml

requirements:
  - b10-transfer

Update start command

Under start command, add b10-compile-cache & right before the vllm serve call:

config.yaml

start_command: "... b10-compile-cache & vllm serve ..."

See the full example.

Advanced configuration

Parameter overrides

The torch compile caching library supports several environment variables for fine-tuning behavior in production environments:

Cache directory configuration

TORCHINDUCTOR_CACHE_DIR (optional)

Default: /tmp/torchinductor_<username>
Description: Directory where PyTorch stores compilation artifacts locally
Allowed prefixes: /tmp/, /cache/, ~/.cache
Usage: Set this if you need to customize where torch compilation artifacts are stored on the local filesystem

B10FS_CACHE_DIR (optional)

Default: Derived from b10cache mount point + /compile_cache
Description: Directory in b10cache where compilation artifacts are persisted across deployments
Usage: Typically doesn’t need to be changed as it’s automatically configured based on your b10cache setup

LOCAL_WORK_DIR (optional)

Default: /app
Description: Local working directory for temporary operations
Allowed prefixes: /app/, /tmp/, /cache/

Performance and resource limits

MAX_CACHE_SIZE_MB (optional)

Default: 1024 (1GB)
Cap: Limited by MAX_CACHE_SIZE_CAP_MB for safety
Description: Maximum size of a single cache archive in megabytes
Usage: Increase for larger models with extensive compilation artifacts, decrease to save storage

MAX_CONCURRENT_SAVES (optional)

Default: 50
Cap: Limited by MAX_CONCURRENT_SAVES_CAP for safety
Description: Maximum number of concurrent save operations allowed
Usage: Tune based on your deployment’s concurrency requirements and storage performance

Cleanup and maintenance

CLEANUP_LOCK_TIMEOUT_SECONDS (optional)

Default: 30
Cap: Limited by LOCK_TIMEOUT_CAP_SECONDS
Description: Timeout for cleaning up stale lock files, to prevent deadlocks. They may occur when a replica holding the lock crashes.
Usage: Decrease if you’re experiencing deadlocks in high-load scenarios

CLEANUP_INCOMPLETE_TIMEOUT_SECONDS (optional)

Default: 60
Cap: Limited by INCOMPLETE_TIMEOUT_CAP_SECONDS
Description: Timeout for cleaning up incomplete cache files
Usage: Increase for slower storage systems or larger cache files

Example configuration

config.yaml

environment_variables:
  MAX_CACHE_SIZE_MB: "2048"
  MAX_CONCURRENT_SAVES: "25"
  CLEANUP_LOCK_TIMEOUT_SECONDS: "45"

The defaults suit most workloads. Tune them if a model needs a larger cache archive or hits contention on concurrent saves.

To understand implementation details, read the PyTorch torch compile caching tutorial.

Overview

Get started

Model APIs

Inference

Development

Deployment

Engines

Frontier Gateway

Training

Organization

Observability

Troubleshooting

Runtime caching

How b10cache works

Organization scope: `/cache/org/`

Deployment scope: `/cache/model/`

Not persistent object storage

Torch compile caching

Implementation options

Truss models (`model.py`)

API reference

Implementation example

Update `config.yaml`

Update `model.py`

vLLM servers (CLI tool)

Add requirements

Update start command

Advanced configuration

Cache directory configuration

Performance and resource limits

Cleanup and maintenance

Example configuration

Next steps

BDN

Performance optimization

​How b10cache works

​Organization scope: /cache/org/

​Deployment scope: /cache/model/

​Not persistent object storage

​Torch compile caching

​Implementation options

​Truss models (model.py)

​API reference

​Implementation example

Update config.yaml

Update model.py

​vLLM servers (CLI tool)

​Add requirements

​Update start command

​Advanced configuration

​Cache directory configuration

​Performance and resource limits

​Cleanup and maintenance

​Example configuration

​Next steps

BDN

Performance optimization

How b10cache works

Organization scope: `/cache/org/`

Deployment scope: `/cache/model/`

Not persistent object storage

Torch compile caching

Implementation options

Truss models (`model.py`)

API reference

Implementation example

Update `config.yaml`

Update `model.py`

vLLM servers (CLI tool)

Add requirements

Update start command

Advanced configuration

Cache directory configuration

Performance and resource limits

Cleanup and maintenance

Example configuration

Next steps