Text to speech
Building a text-to-speech model with Kokoro
View example on GitHub
In this example, we go through a Truss that serves Kokoro, a frontier text-to-speech model.
Set up imports
We import necessary libraries and enable Hugging Face file transfers. We also download the NLTK tokenizer data.
Downloading model weights
We need to prepare model weights by doing the following:
- Create a directory for the model data
- Download the Kokoro model from Hugging Face into the created model data directory
- Add the model data directory to the system path
Define the Model
class and load
function
In the load
function of the Truss, we download and set up the model. This load
function handles setting up the device, loading the model weights, and loading the default voice. We also define the available voices.
Define the predict
function
The predict
function contains the actual inference logic. The steps here are:
- Process input text and handle voice selection
- Chunk text for long inputs
- Generate audio
- Convert resulting audio to base64 and return it
Setting up the config.yaml
Running Kokoro requires a handful of Python libraries, including torch
, transformers
, and others.
Configuring resources for Kokoro
Note that we need an T4 GPU to run this model.
System Packages
Running Kokoro requires espeak-ng
to synthesize speech output.
Deploy the model
Deploy the model like you would other Trusses by running the following command:
Run an inference
Use a Python script to call the deployed model and parse its response. In this example, the script sends text input to the model and saves the returned audio (decoded from base64) as a WAV file: output.wav
.