Voice Recognition Devices

The (narrow) Artificial Intelligence (AI) in our smart home has now reached a point where it is quite advanced. It is not going to be covered in any real detail as part of this project but you can read more about it here.

Our smart home AI has been implemented and exposed in a generic way, such that many types of user interface are possible. It accepts a request in JSON format and also responds with JSON. So far we have been using and testing mostly with text-based interfaces such as web browser, XMPP, SMS, Twitter DM, etc. It is also possible to 'wrap' the text-based interface with spoken voice input (using speech recognition) and text-to-speech, to provide a spoken response.

We are currently using voice control within our smart home using web apps and browser-based interfaces. We also have numerous devices in our home that support voice control, including an Amazon Echo (with remote control), Echo Dot, numerous iOS devices with Siri, Android devices with Google Assistant, a PlayStation 4 with voice control and a Samsung HDTV that supports voice control via its 'smart remote'.

Our focus has now shifted to installing bespoke devices in our smart home, to better enable voice-based interaction with our own AI. With this in mind we have three types of voice interface we want to enable:

iOS app
An iOS app with a push-to-talk 'button'. This would allow an authenticated and personalised experience that would work within and outside of the home. The iOS device is assumed to be authenticated and unique for this application, allowing a fully personalised user experience.

voice control remote
A 'remote control' device with a push-to-talk 'button', much like the Amazon Echo remote. This would allow anyone with physical access to interact with our smart home as an unknown or guest user. These devices would have a unique identifier that would allow them to be associated with a single person and thus provide a personalised user experience.

ReSpeaker far field microphone array
A device with a far field microphone array (example pictured is the ReSpeaker and a wake word chosen by us to activate the speech recognition. This would also allow anyone with physical access to interact with our smart home as an unknown or guest user.

We are not trying to duplicate a device like the Amazon Echo. The Amazon Echo has broad artificial intelligence that allows it to answer general queries and interact with some smart home devices in your home (in a very limited fashion), using voice. This project aims to deliver a number of voice interfaces to our narrow artificial intelligence, to query and control everything we have connected to our smart home. Instead of being a 'jack of all trades', it is a master of just one, our smart home.

With all of the above interfaces, we require them to be able to send a JSON request and receive the JSON response to our smart home. They could use local or cloud-based 3rd party services to convert speech to text and text to speech, as these are areas where we don't have the skills and resources. Our focus is purely on the 'AI' behind the JSON request (in English) and the JSON response (also in English).

This project is purely about finding the best open-source and 3rd party services and hardware available to enable us to provide the best smart home user experience, in line with our smart home mission. We are interested in talking to any person, service, project or company that can assist us with this project.

Design Objectives

Best User Experience

In keeping with our smart home mission, the interfaces being developed for this project will enable the best possible smart home user experience. And this is why we haven't specified just one voice interface here, but three.

Our approach also provides a much better user experience because our smart home has all the information, context and state to make the most intelligent response. Because it also has all of our smart home 'models', it knows about all of the individual elements and objects, types of objects, the relationships between these objects, etc. This allows it to answer much more abstract and also more complex questions.

Performance

A key aspect of a great user experience is good performance. This means a fast response time and also accurate speech recognition. It also means text to speech that is clear and easily understood.

Privacy

The main driver behind developing our own smart home is to improve our privacy. This project is no exception and only the spoken words will leave our home, to be converted to text by a secure 3rd party service. Ideally, all text-to-speech would be done within our home. This approach is radically different to that taken by most smart home voice control services currently on the market.

Technology Dependencies

A key objective with our approach is to remove the dependencies to use one or more suppliers and to also be limited by their approach to voice control. Having to specify a set of 'intents' in order to interact with our smart home is very limiting. Being able to pass on the raw spoken request, means that our smart home can always respond to the best of its abilities.

Architecture

It is with these above design objectives in mind that we started building our own smart home and this has dictated the architecture we are proposing for this project:

Basically, our smart home all resides physically within the confines of our home, with all the 'intelligence' and decision making occurring locally and all sensor data and device data staying within the confines of our home. The only data leaving our home is the spoken request, which is passed to a 3rd party service to convert to text. This text is then passed to our smart home's Natural Language Processing (NLP) engine and AI, to decide what response is made. The response is English text and this is then sent to a 3rd party text to speech service, to be spoken out but the device, remote or app.

In an ideal world, the two network services would also be located within our home.

iOS App

The functionality of the iOS app is quite simple and involves repeating these steps:

  • Display a large push-to-talk button on screen.
  • On pushing the button, the app captures the spoken input and on release of the button sends it to a 3rd party service to be converted to text.
  • Display the English text returned in the response underneath the button.
  • Send a JSON request to the smart home server, containing the English text.
  • Receive the JSON response from the smart home server, which contains an English text response.
  • Use a 3rd party service to convert the text to speech and output this as audio.

Remote Control

We have researched online to see if anyone has hacked an Amazon Echo remote to enable it to work with another service but have not found anything yet. We are not aware of any other similar devices

Far Field Device

There are numerous open-source projects and 3rd party developments in this space and we are currently investigating the best option, to see which one best meets our needs.

Share ...
We are on ...
Facebook Twitter
YouTube Flickr Follow us on Pinterest