Get Started with Alexa Voice Service Device SDK
The Alexa Voice Service (AVS) Device SDK provides you with a set of C ++ libraries to build an Alexa Built-in product. With these libraries, your device has direct access to cloud-based Alexa capabilities to receive voice and visual responses instantly. You can create a wide range of devices—a smartwatch, speaker, headphones, smart TV, set-top boxes, soundbar, AV receiver, home theater hub, or a gaming console—the choice is yours.
The SDK is modular and abstract. It provides separate components to handle necessary Alexa functionality including processing audio, maintaining persistent connections, and managing Alexa interactions. Each component exposes Alexa APIs to customize your device integrations as needed.
The SDK also includes two sample applications to use as the basis for your device, and to test interactions before integration:
- AVS Device SDK Console Sample Application– Intended primarily for learning and for basic, headless device integrations.
- AVS Device SDK IPC Server Sample Application– Intended for smart screen device makers implementing the full set of available Alexa multimodal features.
The following diagram illustrates components of the SDK and how data flows between them.
The green boxes are official components of the SDK. They include the following items:
- Audio Input Processor (AIP)
- Shared Data Stream (SDS)
- Alexa Communication Library (ACL)
- Alexa Directive Sequencer Library (ADSL)
- Activity Focus Manager Library (AFML)
- Capability Agent
The white and blue boxes aren't official components and depend on external libraries. These include the following items:
- Audio Signal Processor (ASP)
- Wake Word Engine (WWE)
- Media Player
For general information about Alexa and client interaction, see the Interaction Model.
The following list shows the interaction sequence with the SDK (The process might vary if you've added or removed any components):
- You ask a question, "Alexa, what’s the weather?"
- The microphone captures the audio and writes it to the SDS.
- The WWE is always monitoring the SDS. When the WWE detects the wake word Alexa, it sends the audio to the AIP.
- The AIP sends a
SpeechRecognizerevent to AVS using the ACL.
- AVS processes the event and sends the appropriate directive back down through the ACL. The SDS then picks up the directive and sends it to the ADSL.
- The ADSL examines the header of the payload and determines what Capability Agent it must call.
- When the Capability Agent activates, it requests focus from the AFML.
- The Media Player plays the directive. For this example, Alexa responds with "The weather is nine degrees and cloudy with a chance of rain."
Here are some details about each individual component in the sequence.
Audio Signal Processor (ASP)
The ASP isn't actually a component of the AVS Device SDK. It's Software On a Chip (SOC) or firmware on a dedicated Digital Signal Processor (DSP) whose job is to clean up the audio and create a single audio stream, even if your device uses a multimicrophone array. Techniques used to clean the audio include Acoustic Echo Cancellation (AEC), noise suppression, beam forming, Voice Activity Detection (VAD), Dynamic Range Compression (DRC), and equalization.
Shared Data Stream (SDS)
The SDS is single producer, multi-consumer audio input buffer that transports data between a single writer and one or more readers. This ring buffer moves data throughout the different components of the SDK without duplication. This process minimizes the memory footprint, as it continuously overwrites itself. SDS operates on product-specific and user-specified memory segments, allowing for interprocess communication. Keep in mind, the writer and readers might be in different threads or processes.
SDS handles the following key tasks:
- Receives audio from the ASP and then passes it to the WWE.
- Passes the audio from the WWE engine to the ACL. The ACL then passes the audio to AVS for processing.
- Receives data attachments back from the ACL and passes it to the appropriate Capability Agent.
Wake Word Engine (WWE)
The WWE is software that constantly monitors the SDS, waiting for a preconfigured wake word. When the WWE detects the correct wake word, it notifies the AIP to begin reading the audio. When using the AVS Device SDK, the wake word is alway "Alexa".
The WWE consists of following two binary interfaces:
- Interface 1– Handles general wake word detection.
- Interface 2– Handles specific wake word models.
Audio Input Processor (AIP)
Responsibilities of the AIP include reading audio from the SDS and then sending it to AVS for processing. The AIP also includes the logic to switch between different audio input sources. The AIP triggers with the following inputs:
- External audio– Captured with on-device microphones, remote microphones and other audio input sources.
- Tap-to-Talk – Captured with designated Tap-to-Talk inputs.
- Speech directive– Sent from AVS to continue an interaction. For example, multiturn dialog.
When triggered, the AIP continues to stream audio until it receives a
Stop directive or times out. AVS can only receive one audio input source at any given time.
Alexa Communications Library (ACL)
The ACL manages the network connection between the SDK and AVS. The ACL performs the following key functions:
- Establishes and maintains long-lived persistent connections with AVS. ACL adheres to the messaging specification detailed in Managing an HTTP/2 Connection with AVS.
- Provides message sending and receiving capabilities. These capabilities include support JSON-formatted text, and binary audio content. For more details, see Structuring an HTTP/2 Request to AVS.
- Forwards incoming directives to the ADSL.
- Handles disconnect and reconnections. If the device disconnects, it automatically attempts to reconnect for you.
- Manages secure connections.
Alexa Directive Sequencer Library (ADSL)
- Accepts directives from the ACL.
- Manages the lifecycle of each directive, including queuing, reordering, or canceling directives as necessary.
- Forwards directives to the appropriate Capability Agents by examining the directive header and reading the namespace of the interface.
A Capability Agent is what performs the desired action on a device. They map directly to interfaces supported by AVS. For example, if you ask Alexa to play a song, the Capability Agent is what loads the song into your media player and plays it. A Capability Agent performs the following two tasks:
- Receives the appropriate directive from the ADSL.
- Reads the payload and performs the requested action on the device.
Activity Focus Manager Library (AFML)
The AFML makes sure the SDK handles directives in the correct order. It determines which capability has control over the input and output of the device at any time. For example, if you're playing music and an alarm goes off on your device, the alarm takes focus over the music. The music pauses and the alarm rings.
Focus uses a concept called channels to govern the prioritization of audiovisual inputs and outputs.
Channels exist in the foreground or background. At any given time, only one channel can inherit the foreground state and take focus. If more than one channel is active, a device must respect the following priority order: Dialog > Alerts > Content. When a channel in the foreground becomes inactive, the next active channel in the priority order moves into the foreground.
Focus management isn't specific to Capability Agents or Directive Handlers. Agents that aren't related to Alexa also use it. Focus management enables all agents by using the AFML to have a consistent focus across a device.
The Presentation Orchestrator assists in managing the lifecycle of visual presentations on multimodal devices. The following list shows the lifecycle management activities:
- Taking requests for windows
- Dismissing or backgrounding presentation
- Managing timeout
- State tracking and reporting
- Handling back or exit navigation command
The media player isn't actually a component of the AVS Device SDK. The SDK comes with a wrapper for Gstreamer and Android Media Player. If you want to use a different media player, you must build a wrapper for it with the
MediaPlayer interface. For more details about custom media players, see media player.
- Review the AVS Terms and Agreements.
- The sound files – known as earcons – associated with the sample project are for prototyping purposes. For implementation and design guidance for commercial products, see Designing for AVS and AVS UX Guidelines.
All Alexa products must meet the AVS Security Requirements. In addition, when building the AVS Device SDK, you are required to adhere to the following security principles:
- Protect configuration parameters, such as those found in the
AlexaClientSDKConfig.jsonfile, from tampering and inspection.
- Protect executable files and processes from tampering and inspection.
- Protect persistent states of the SDK from tampering and inspection.
- Your C++ implementation of AVS Device SDK interfaces must not retain locks, crash, stop responding, or throw exceptions.
- Use exploit mitigation flags and memory randomization techniques when you compile your source code to prevent vulnerabilities from exploiting buffer overflows and memory corruptions.
For smart screen devices, refer to Additional security requirements for smart screen devices.