Apple’s online Machine Learning Journal has published a paper on the methodologies the HomePod uses to implement Siri functionality in far-field settings.
A new entry on Apple’s Machine Learning Journal goes into great detail on how it all works when the HomePod is listening for that “Hey, Siri” call out, leveraging technology to keep things going even in specific situations that could normally make listening difficult. That includes when the smart speaker is playing loud music, when the person talking to the HomePod is far away from the device, or when there are other objects in the room, like a TV, playing audio as well:
The typical audio environment for HomePod has many challenges — echo, reverberation, and noise. Unlike Siri on iPhone, which operates close to the user’s mouth, Siri on HomePod must work well in a far-field setting. Users want to invoke Siri from many locations, like the couch or the kitchen, without regard to where HomePod sits. A complete online system, which addresses all of the environmental issues that HomePod can experience, requires a tight integration of various multichannel signal processing technologies. Accordingly, the Audio Software Engineering and Siri Speech teams built a system that integrates both supervised deep learning models and unsupervised online learning algorithms and that leverages multiple microphone signals.
Specifically, the entry discusses two things: mask-based multichannel filtering using deep learning to remove echo and background noise and unsupervised learning to separate simultaneous sound sources and trigger-phrase based stream selection to eliminate interfering speech:
Far-field speech recognition becomes more challenging when another active talker, like a person or a TV, is present in the same room with the target talker. In this scenario, voice trigger detection, speech decoding, and endpointing can be substantially degraded if the voice command isn’t separated from the interfering speech components. Traditionally, researchers tackle speech source separation using either unsupervised methods, like independent component analysis and clustering , or deep learning [5, 6]. These techniques can improve automatic speech recognition in conferencing applications or on batches of synthetic speech mixtures where each speech signal is extracted and transcribed [6, 7]. Unfortunately, the usability of these batch techniques in far-field voice command-driven interfaces is very limited. Furthermore, the effect of source separation on voice trigger detection, such as that used with “Hey Siri”, has never been investigated previously. Finally, it’s crucial to separate far-field mixtures of competing signals online to avoid latencies and to select and decode only the target stream containing the voice command.
Long story short, Siri on HomePod implements the Multichannel Echo Cancellation (MCEC) algorithm which uses a set of linear adaptive filters to model the multiple acoustic paths between the loudspeakers and the microphones to cancel the acoustic coupling.
Due to the close proximity of the loudspeakers to the microphones on HomePod, the playback signal can be significantly louder than a user’s voice command at the microphone positions, especially when the user moves away from the device. In fact, the echo signals may be 30-40 dB louder than the far-field speech signals, resulting in the trigger phrase being undetectable on the microphones during loud music playback.
The paper concludes with examples of filtered and unfiltered audio from those HomePod tests. Regardless of whether you’re interested in the details of noise reduction technology, the sample audio clips are worth a listen. It’s impressive to hear barely audible commands emerge from background noises like a dishwasher and music playback.
Although the journal entry is very technical in nature, it’s a fascinating glimpse into the technology underpinning the HomePod.