Deploy with optimized inference engines

When a Baseten Training job completes, Baseten automatically saves your checkpoints to Baseten storage. You can deploy any of them to an inference engine without downloading or re-uploading anything. Engine-Builder-LLM, BEI, and BIS-LLM all support this workflow.

For deploying weights from external cloud storage (GCS, S3, Azure), see Deploy from cloud storage.

Checkpoint reference

The repo and revision fields in checkpoint_repository specify which training project and checkpoint to deploy.

repo: Your Baseten Training project name.
revision: Which job and checkpoint to target. The following formats are supported:

`revision` value	Deploys
`<job_id>/<checkpoint_name>`	A specific checkpoint from a specific job (for example, `abc123/checkpoint-100`)
`<job_id>`	The latest checkpoint from a specific job
`latest` or omitted	The latest checkpoint from the latest job

To look up checkpoint names for a job, run:

truss train get_checkpoint_urls --job-id=YOUR_TRAINING_JOB_ID

LLM deployment

Use Engine-Builder-LLM or BIS-LLM to deploy a fine-tuned language model. Set base_model to decoder:

config.yaml

model_name: My Fine-Tuned LLM
resources:
  accelerator: H100
  use_gpu: true
trt_llm:
  build:
    base_model: decoder
    checkpoint_repository:
      source: BASETEN_TRAINING
      repo: YOUR_TRAINING_PROJECT_NAME
      revision: YOUR_TRAINING_JOB_ID/checkpoint-100

Once deployed, call the model using the OpenAI-compatible chat completions endpoint:

curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "1", "messages": [{"role": "user", "content": "Hello"}]}'

See Call your model for full inference options including streaming and the OpenAI SDK.

Embeddings deployment

Use BEI to deploy a fine-tuned embedding or reranker model. Use encoder_bert for BERT-based models (sentence-transformers, rerankers, classifiers) or encoder for causal embedding models:

config.yaml

model_name: My Fine-Tuned Embeddings
resources:
  accelerator: A10G
  use_gpu: true
trt_llm:
  build:
    base_model: encoder_bert
    checkpoint_repository:
      source: BASETEN_TRAINING
      repo: YOUR_TRAINING_PROJECT_NAME
      revision: YOUR_TRAINING_JOB_ID/checkpoint-100
    max_num_tokens: 16384
  runtime:
    webserver_default_route: /v1/embeddings

Encoder models have specific requirements:

No tensor parallelism: Omit tensor_parallel_count or set it to 1.
Fast tokenizer required: Your checkpoint must include a tokenizer.json file. Models using only the legacy vocab.txt format aren’t supported.
Embedding model files: For sentence-transformer models, include modules.json and 1_Pooling/config.json in your checkpoint.

The webserver_default_route field sets the inference endpoint path:

/v1/embeddings: For embedding models.
/rerank: For rerankers.
/predict: For classifiers.
/predict_tokens: For token-level prediction.

Once deployed, call the model using the embeddings endpoint:

curl -X POST https://model-YOUR_MODEL_ID.api.baseten.co/environments/production/sync/v1/embeddings \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "1", "input": "Your text here"}'

See Call your model for full inference options.

Engine-Builder-LLM configuration: Complete build and runtime options for LLMs.
BEI reference configuration: Complete configuration for encoder models.
Deploy from cloud storage: GCS, S3, and Azure deployment using checkpoint_repository.
Baseten Training overview: Training jobs, checkpoints, and the full training workflow.
Secrets management: Configure credentials for private storage.

​Checkpoint reference

​LLM deployment

​Embeddings deployment

​Related

Checkpoint reference

LLM deployment

Embeddings deployment

Related