Iā€™ve fallen into the habit of staring at the ā€˜modelsā€™ page of huggingface like a hawk scanning the landscape for potential prey.

One day, I came across one of the lightest TTS models Iā€™ve seen called Kokoro-82M. I was astounded by the fidelity of the voices and immediately had to clone the project and get it running locally.

To my surprise, it ran smoothly and the voices were just as good as advertised. I was working on my other project Compass at the time, which aims to be a fully local AI assistant, and thought Kokoro would be perfect for it. However, around the same time, I had just received my Home Assistant Voice and found that the voices available for it were sub-par for living in 2025.

So, I decided that it would be more worthwhile to develop a TTS server that could work with Home Assistant and be used by everyone who knows how to spin up a docker container.

The existing TTS solution for HASS (Home Assistant) is Piper and fortunately it uses an open protocol called the Wyoming Protocol.

So the plan was simple:

  1. Create a dummy ā€œhello worldā€ project that sends a pre-recorded sound over Wyoming Protocol to HASS when receiving a TTS request
  2. Implement the Kokoro generator to generate speech from the TTS request
  3. Ensure it works well (this oneā€™s the hardest!)

Hello world

Thereā€™s a Python library for wyoming which unfortunately itā€™s rather lacking in documentation. The Github page lays out the overall structure of the protocol, but doesnā€™t go into much detail about the library either.

So, I decided to clone the Piper project and slowly build up a separate project that contained the bare-bones of what was necessary, since the Piper project has fancy multi-threading and is split neatly into separate files with interesting scripts to run different parts.

After a few nights of tinkering, I finally got it working. Importantly, I had to make sure I warmed everything up before the first request - otherwise it would simply fail - which I guess is due to some timeout issues.

Kokoro

I implemented Kokoro using the official Kokoro library, which was very straightforward.

I ran it on a local machine and it worked - however it was rather slow šŸ¤”

Sentence split/ting

Yep - itā€™s not enough to just pass the entire text to the TTS model. It needs to be split into sentences first.

Using regex, we can split the text into sentences based on punctuation marks, with the code pattern:

    pattern = r'(?<=[.!?])\s+'
    sentences = re.split(pattern, text)
  1. It looks for any of these punctuation marks:
  • Period (.)
  • Exclamation mark (!)
  • Question mark (?)
  1. After finding one of these marks, it looks for one or more spaces.

So for example, if you had this text:

"It was a dark and stormy night. Do you remember the old days? The rain fell in torrents."

The regex would find the punctuation marks and split the text into the following sentences:

["It was a dark and stormy night.", "Do you remember the old days?", "The rain fell in torrents."]

With this additional step, the TTS model can generate each sentence individually and hences start sending the audio chunks over after having processed only the first sentence.Much better!

Docker

I then decided to package it up into a Docker container and release it on Docker Hub, however I found that the image was taking up 10GB of space. This was due to the Kokoro library having a dependency on PyTorch and other heavy libraries!

So, back to the drawing board.

Kokoro-ONNX

I found there was an alternative implementation of Kokoro called Kokoro-Onnx which uses ONNX to run the model. This was a much lighter image and worked great, while still being fast.

Finally, I was able to release the Docker image, taking up only 1.29 GB of space! Much better :)

I now have a fully local TTS solution that works well with Home Assistant, and is much higher quality than the default Piper voices. Feel free to check it out on Github and give me a star if you like it!