Yr of the Voice – Chapter 2: Let’s speak

This 12 months is Residence Assistant’s Yr of the Voice. It’s our purpose for 2023 to let customers management Residence Assistant in their very own language. At this time we’re presenting Chapter 2, our second milestone in constructing in the direction of this purpose.

In Chapter 1, we targeted on intents – what the consumer desires to do. At this time, the Residence Assistant group has translated frequent sensible residence instructions and responses into 45 languages, closing in on the 62 languages that Residence Assistant helps.

For Chapter 2, we’ve expanded past textual content to now embody audio; particularly, turning audio (speech) into textual content, and textual content again into speech. With this performance, Residence Assistant’s Help characteristic is now capable of present a full voice interface for customers to work together with.

A voice assistant additionally wants {hardware}, so right this moment we’re launching ESPHome assist for Help and; to prime it off: we’re launching the World’s Most Personal Voice Assistant. Hold studying to see what that entails.

To look at the video presentation of this weblog submit, together with reside demos, verify the recording of our live stream.

Composing Voice Assistants

The new Assist Pipeline integration allows you to configure all components that make up a voice assistant in a single place.

For voice commands, pipelines start with audio. A speech-to-text system determines the words the user speaks, which are then forwarded to a conversation agent. The intent is extracted from the text by the agent and executed by Home Assistant. At this point, “turn on the light” would cause your light to turn on 💡. The last part of the pipeline is text-to-speech, where the agent’s response is spoken back to you. This may be a simple confirmation (“Turned on light”) or the answer to a question, such as “Which lights are on?”


Screenshot of the new Assist configuration in Home Assistant.

With the new Voice Assistant settings page users can create multiple assistants, mixing and matching voice services. Want a U.S. English assistant that responds with a British accent? No problem. What about a second assistant that listens for Dutch, German, or French voice commands? Or maybe you want to throw ChatGPT in the mix. Create as many assistants as you want, and use them from the Assist dialog as well as voice assistant hardware for Home Assistant.

Interacting with many different services means that many different things can go wrong. To help users figure out what went wrong, we’ve built extensive debug tooling for voice assistants into Home Assistant. You can always inspect the last 10 interactions per voice assistant.


Screenshot of the new Assist debug tool.

Voice Assistant powered by Home Assistant Cloud

The Home Assistant Cloud subscription, moreover end-to-end encrypted distant connection, consists of state-of-the-art speech-to-text and text-to-speech providers. This permits your voice assistant to talk 130+ languages (together with dialects like Peruvian Spanish) and is extraordinarily quick to reply. Pattern:

As a subscriber, you’ll be able to instantly begin utilizing voice in Residence Assistant. You’ll not want any additional {hardware} or software program to get began.

Along with top quality speech-to-text and text-to-speech in your voice assistants, additionally, you will be supporting the event of Residence Assistant itself.

Join Home Assistant Cloud today

The fully local voice assistant

With Home Assistant you can be guaranteed two things: there will be options and one of those options will be local. With our voice assistant that’s no different.

Piper: our new model for high quality local text-to-speech

To make quality text-to-speech running locally possible, we’ve had to create our own text-to-speech system that is optimized for running on a Raspberry Pi 4. It’s called Piper.

Piper logo

Piper uses modern machine learning algorithms for realistic-sounding speech however can nonetheless generate audio rapidly. On a Raspberry Pi 4, Piper can generate 2 seconds of audio with just one second of processing time. Extra highly effective CPUs, such because the Intel Core i5, can generate 17 seconds of audio in the identical period of time. Pattern:

For extra samples, see the Piper website

An add-on with Piper is accessible now for Residence Assistant with over 40 voices across 18 languages, together with: Catalan, Danish, German, English, Spanish, Finnish, French, Greek, Italian, Kazakh, Nepali, Dutch, Norwegian, Polish, Brazilian Portuguese, Ukrainian, Vietnamese, and Chinese language. Voices for Piper are educated from open audio datasets, a lot of which come from free audiobooks read by volunteers. For those who’re curious about contributing your voice, tell us!

You can too run Piper as a standalone Docker container.

Local speech-to-text with OpenAI Whisper

Whisper is an open supply speech-to-text mannequin created by OpenAI that runs domestically. Since its launch in 2022, Whisper has been improved by the open supply group to run on much less highly effective {hardware} by tasks resembling whisper.cpp and faster-whisper. In lower than a 12 months of progress, Whisper is now able to offering speech-to-text for dozens of languages on small servers and single-board computer systems!

An add-on using faster-whisper is accessible now for Residence Assistant. On a Raspberry Pi 4, voice instructions can take round 7 seconds to course of with about 200 MB of RAM used. An Intel Core i5 CPU or higher is able to sub-second response instances and might run bigger (and extra correct) variations of Whisper.

You can too run Whisper as a standalone Docker container.

Wyoming: the voice assistant glue

Voice assistants share many common functions, such as speech-to-text, intent-recognition, and text-to-speech. We created the Wyoming protocol to supply a small set of normal messages for speaking to voice assistant providers, together with the flexibility to stream audio.

Wyoming permits builders to deal with the core of a voice service with out having to decide to a particular networking stack like HTTP or MQTT. This protocol is suitable with the upcoming version 3.0 of Rhasspy, so each tasks can share voice providers.

With Wyoming, we’re making an attempt to kickstart a extra interoperable open voice ecosystem that makes sharing parts throughout tasks and platforms straightforward. Builders and scientists wishing to experiment with new voice applied sciences want solely implement a small set of messages to combine with different voice assistant tasks.

The Whisper and Piper add-ons talked about above are built-in into Residence Assistant by way of the brand new Wyoming integration. Wyoming providers will also be run on different machines and nonetheless combine into Residence Assistant.

ESPHome powered voice assistants

ESPHome is our software program for microcontrollers. As a substitute of programming, customers outline how their sensors are related in a YAML file. ESPHome will learn this file and generate and set up software program in your microcontroller to make this information accessible in Residence Assistant.

At this time we’re launching assist for constructing voice assistants utilizing ESPHome. Join a microphone to your ESPHome system, and you’ll management your sensible residence together with your voice. Embrace a speaker and the sensible residence will communicate again.

We’ve been specializing in the M5STACK ATOM Echo for testing and growth. For $13 it comes with a microphone and a speaker in a pleasant little field. We’ve created a tutorial to show this system right into a voice distant instantly out of your browser!

Tutorial: create a $13 voice distant for Residence Assistant.

ESPHome Voice Assistant documentation.

World’s Most Personal Voice Assistant

For those who have been designing the world’s most non-public voice assistant, what options would it not have? To begin, it ought to solely pay attention once you’re prepared to speak, slightly than on a regular basis. And when it responds, you need to be the one one to listen to it. This sounds surprisingly acquainted…🤔

A cellphone! No, not the featureless rectangle you may have in your pocket; an analog cellphone. These nice creatures as soon as dominated the Earth with twisty cords and distinctive seems to be to match your fashion. Analog telephones have a well-known interface that’s arduous to beat: decide up the cellphone to pay attention/communicate and put it down when executed.

With Residence Assistant’s new Voice-over-IP integration, you can now use an “old school” phone to control your smart home!

By configuring off-hook autodial, your phone will automatically call Home Assistant when you pick it up. Speak your voice command or question, and listen for the response. The conversation will continue as long as you please: speak more commands/questions, or simply hang up. Assign a unique voice assistant/pipeline to each VoIP adapter, enabling dedicated phones for specific languages.

We’ve focused our initial efforts on supporting the Grandstream HT801 Voice-over-IP box. It really works with any cellphone with an RJ11 connector, and connects on to Residence Assistant. There isn’t a want for an additional server.

Tutorial: create your individual World’s Most Personal Voice Assistant

Give your voice assistant character utilizing the OpenAI integration.

Some hyperlinks on this web page are affiliate hyperlinks and purchases utilizing these hyperlinks assist the Residence Assistant venture.