Devices should provide transparent, easily predictable and expected behaviors and experiences to customers when operating a device with multiple simultaneously available voice agents.
The attention system on a device is an important factor in building and maintaining customer trust in your device. Just as it is for single agent products, multi-agent products should clearly communicate the current attention state to customers. Customers should easily be able to understand what state the device is in, or any active agent on the device, as well as when that state changes. This section describes recommendations for attention system behaviors in multi-agent interactions.
All coexisting agents should convey at least the 3 core attention states:
Meaning an agent has been invoked, either by voice or touch, and is recording a customer utterance.
Meaning playing a voice reply, or otherwise delivering a response to the customer. (Optional for non-voice responses or for devices replying using visuals on a screen).
Meaning the agent or device is processing a request or waiting for a reply from the voice service. (This may not apply when, for example, local agents have no perceived latency between Listening and Speaking.)
Visual and sound cues for the 3 core attention states should be clear and easy to understand for all active agents, even if some are unique to an agent or device.
Securing a device with multiple simultaneous voice agents requires a multifaceted approach in each step of the development process and beyond. Device and agent makers should evaluate potential threat scenarios by performing threat modeling for all features and use cases for their device. The following list represents general security guidelines.