Skip to main content
When a Baseten Training job completes, Baseten automatically saves your checkpoints to Baseten storage. You can deploy any of them to an inference engine without downloading or re-uploading anything. Engine-Builder-LLM, BEI, and BIS-LLM all support this workflow.
For deploying weights from external cloud storage (GCS, S3, Azure), see Deploy from cloud storage.

Checkpoint reference

The repo and revision fields in checkpoint_repository specify which training project and checkpoint to deploy.
  • repo: Your Baseten Training project name.
  • revision: Which job and checkpoint to target. The following formats are supported:
revision valueDeploys
<job_id>/<checkpoint_name>A specific checkpoint from a specific job (for example, abc123/checkpoint-100)
<job_id>The latest checkpoint from a specific job
latest or omittedThe latest checkpoint from the latest job
To look up checkpoint names for a job, run:
truss train get_checkpoint_urls --job-id=YOUR_TRAINING_JOB_ID

LLM deployment

Use Engine-Builder-LLM or BIS-LLM to deploy a fine-tuned language model. Set base_model to decoder:
config.yaml
model_name: My Fine-Tuned LLM
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: BASETEN_TRAINING
      repo: YOUR_TRAINING_PROJECT_NAME
      revision: YOUR_TRAINING_JOB_ID/checkpoint-100
Once deployed, call the model using the OpenAI-compatible chat completions endpoint:
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "1", "messages": [{"role": "user", "content": "Hello"}]}'
See Call your model for full inference options including streaming and the OpenAI SDK.

Embeddings deployment

Use BEI to deploy a fine-tuned embedding or reranker model. Use encoder_bert for BERT-based models (sentence-transformers, rerankers, classifiers) or encoder for causal embedding models:
config.yaml
model_name: My Fine-Tuned Embeddings
resources:
  accelerator: A10G
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: BASETEN_TRAINING
      repo: YOUR_TRAINING_PROJECT_NAME
      revision: YOUR_TRAINING_JOB_ID/checkpoint-100
    max_num_tokens: 16384
  runtime:
    webserver_default_route: /v1/embeddings
Encoder models have specific requirements:
  • No tensor parallelism: Omit tensor_parallel_count or set it to 1.
  • Fast tokenizer required: Your checkpoint must include a tokenizer.json file. Models using only the legacy vocab.txt format aren’t supported.
  • Embedding model files: For sentence-transformer models, include modules.json and 1_Pooling/config.json in your checkpoint.
The webserver_default_route field sets the inference endpoint path:
  • /v1/embeddings: For embedding models.
  • /rerank: For rerankers.
  • /predict: For classifiers.
  • /predict_tokens: For token-level prediction.
Once deployed, call the model using the embeddings endpoint:
curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/sync/v1/embeddings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "1", "input": "Your text here"}'
See Call your model for full inference options.