In order to advance clear discussions of multi-agent product design, it is important to use a consistent shared vocabulary. The list included here is not meant to be comprehensive, but to promote a widely adopted set of standard terms.
Any voice agent currently capable of responding to a customer invocation.
The digital “person” that the customer interacts with through conversation (turn taking). An agent has its own brand (voice, personality), method of invocation (custom wake word, Action button), and one or more unique capabilities.
The process of determining which voice agent will participate in an interaction.
The action of clearly ascribing a response to the Agent responsible for providing it. (See also Content Attribution.)
When customers ask something of an agent who cannot directly fulfill their request, the agent can summon a second agent to assist. No data or context is passed between agents during a transfer and the user repeats their request directly to the second agent, but doesn’t have to invoke the second agent using its wake word.
A method of Agent Arbitration whereby a service or mechanism selects which agent will participate in an interaction based on an assessment of all relevant interaction factors.
The stages of interaction with a customer that a voice agent can enter. Minimally comprised of the Listening, Thinking, and Speaking states, the attention states can also include states such as Do Not Disturb or Notifications Pending. (See also Attention System.)
The combination of all visual and audible cues presented to a customer to communicate a voice agent’s attention state. The attention system is typically displayed through animated patterns of lights or colors, along with synchronized sound cues. (See also Attention States.)
Server-resident infrastructure supporting an agent’s ASR, NLU, NLG, TTS, NN and ML based interactions. In a multi-agent scenario, multiple agents may use a single Cloud AI.
Attribution informs the customer of the source of the information or content that they are getting from an agent. From a customer perspective, this enhances the clarity and credibility of some information. For brands, this gives them recognition for the services they are providing. Attribution can be given either verbally (“According to…”) or visually. (See also Agent Attribution.)
The specific action a user wishes to perform. The specific command that is derived from the range of natural language utterances users may speak to convey their intention. The capability needed to respond to specific intents may determine which agent in a multi-agent scenario will respond to the customer.
The method whereby an agent or capability is initiated. This could be a spoken wake word or button press from the customer, or a contextual event such as a timer, geofence or other circumstantial event. Each voice agent will generally have its own unique wake word or other invocation method.
The practice of prepending the response with a sound to orient a customer where they are in the experience. This can be spoken by either the sending or receiving agent (e.g. “Alexa can help with that” or “its Alexa...”). Landmarking provides attribution to the agent handling the request and clarifies for the customer which agent is handling the request.
A person may use an agent across multiple stationary locations or on-the-go.
A product designed to support multiple voice agents.
When two or more wake words are able to invoke voice agents on the same device at all times.
An interaction between a customer and an agent that includes more than one utterance or response. It is often used by an agent or domain to ask for additional information from the customer, or to continue an experience. It is characterized by not requiring the customer to invoke the agent beyond the initial start of the interaction.
The characteristics, or personality, of an agent including its name, wake word, voice, accent, and visual appearance. Each agents has its own persona, and a single agent may also offer a range of identifiable or selectable personas.
An umbrella term that covers invocation of an agent by means of a physical or on-screen affordance such as a button. Push-to-talk includes both tap-to-talk and hold-to-talk implementations. Different agents may be invoked by a single, “overloaded” push-to-talk affordance.
A phrase or a word in a purposeful human-initiated utterance that can be detected to allow an associated agent to start acting on the utterance following the wake word. Otherwise known as “named invocation.” A multi-agent device will recognize different wake words for different agents or personas.