As soon as we started adding slave processors to our Home Control System (HCS) we realised that we would need to have a service that monitored their health and availability, to ensure service levels and that they remained connected to our smart home network. This is required for both complex slave devices based around a Raspberry Pi and more basic slave devices that are based around smaller Arduino processors.
Our Home Control System (HCS) runs a service that handles watchdog events from all the slave processors. It also runs regular checks and can handle any warnings or errors reported back as events. It maintains a database of all slave devices and information about each one, such as its IP address, name, etc. This is used to ensure events received are valid. The security and encryption aspects of the watchdog service are not covered here to keep things simple.
We have configured each slave to send a regular 'heartbeat' event of the form:
<device ID> is a unique identifier for each slave.
<sequence> is a a pseudo-random number that can be predicted from the previous number received. This enables the watchdog service to spot missing heartbeats and to track that they arrive in the correct (secure) sequence. Out of sequence messages and the subsequent synchronisation process are all reported and logged.
The period of time between heartbeat events is configurable for each slave (e.g. 60 seconds). If the watchdog service doesn't see a heartbeat event within the expected window it then reports the device as missing and generates an alert. The exact time used depends on what service(s) the slave processor provides and how critical it is to the overall operation of our smart home. Our HCS is designed to be able to handle millions of events each day, so these present no significant additional load in the wider scheme of things.
Each slave can report errors back to the Home Control System (HCS) via an event of the form:
Watchdog,<device ID>,Error,<error code/message>
A well defined and consistent set of error codes are used throughout our Home Control System (HCS).
Each slave can report warnings back to the Home Control System (HCS) via an event of the form:
Watchdog,<device ID>,Warning,<warning code/message>
This is typically used to report non-critical things that are useful. For example, this could be an unexpected values from a sensor that has been ignored or slow responses. This is particularly useful for debugging and monitoring reliability.
Because our Home Control System (HCS) is a hybrid technology solution we use several languages to implement our watchdog service on the many slave processors in our smart home. The majority use Java though and we have written a single static Java class that is common to both our HCS and the slaves (to simplify code maintenance).
The watchdog service gets its configuration from a JSON file. This makes it very easy to configure and very flexible and extensible.
The main watchdog service is written as a Java class which uses a JSON configuration file. Slave processors use their own Java class to generate heartbeat events and report errors and warnings.
3rd party devices like our Vera Lite Z-Wave gateway also send heartbeat events to the watchdog service. The heartbeat counter is defined in the 'Startup Lua' file and a scheduled scene sends the heartbeat and updates the sequence. This has proved veru useful in monitoring the reliability of our Vera Lite. It has also shown that the Lua engine is restarted when significant changes are made to scenes, etc.
The key capability enabled by the watchdog service is that of timely notifications when a slave processor has failed or become disconnected from our smart home network. We employ this approach for every slave processor deployed in our smart home and the same approach could also be used with 3rd party hardware and services.
Our Home Control System (HCS) watchdog service also keeps track of the number of slave processors that have failed and/or are unreachable. If this number crosses a defined threshold, then it assumes something more serious has happened and a different set of actions and alerts result.