How Apple Uses Deep Learning to Fix Siri’s Robotic Voice


Earlier this year, Apple introduced the “Apple Machine Learning Journal,” a blog detailing Apple’s work on machine learning, AI, and other related topics. The blog is written entirely by Apple’s engineers, and gives them a way to share their progress and interact with other researchers and engineers.

Now, publishing a new round of papers on its machine learning journal, Apple showed off how its AI technology has improved the voice of its Siri digital assistant. The paper describes Siri and Apple Maps’ deep learning-based system for synthesizing speech, which gives their voices “naturalness, personality and expressivity.”

For iOS 11, the engineers at Apple worked with a new female voice actor to record 20 hours of speech in US English and generate between 1 and 2 million audio segments, which were then used to train a deep learning system. The team noted in its paper that test subjects greatly preferred the new version over the old one found in iOS 9 from back in 2015.

Like its biggest competitors, Apple uses rapidly evolving technology called machine learning for making its computing devices better able to understand what we humans want and better able to supply it in a form we humans can understand. A big part of machine learning these days is technology called neural networks that are trained with real-world data — thousands of labeled photos, for example, to build an innate understanding of what a cat looks like.

The results are very noticeable. Siri’s navigation instructions, responses to trivia questions and ‘request completed’ notifications sound a lot less robotic than they did two years ago. One can hear for themselves at the end of this paper from Apple.

The other two blog posts today titled “Improving Neural Network Acoustic Models by Cross-bandwidth and Cross-lingual Initialization” and “Inverse Text Normalization as a Labeling Problem” were also published by Apple’s Siri team. One post details how Siri uses machine learning to display things like dates, times, addresses and currency amounts in a nicely formatted way, and the other techniques Apple uses to make introducing a new language as smooth as possible.


  • Bill___A

    Siri is not very useful and I haven’t seen any improvements from my perspective in years. Fails to even answer the most basic questions with a modicum of intelligence. I’d like to see them progress a lot faster and a lot better.

  • [anonymous]

    I feel like the trick to making Siri more user-friendly would be by making the users responsible for some of the “fixing” on an individual basis. I know intense customization isn’t the most Appley thing, but this would use a similar mechanic to the pre-existing Text Replacement feature. A “Siri Teaching Center” in Settings. Within this section, there would be two areas: “Voice Teaching” and “Command Teaching”.

    Voice Teaching would be useful for people with thick accents that Siri can’t always understand. One user may ask Siri to set an alarm before bed, but Siri consistently mishears the command. To fix this, they could go into Voice Teaching, type out “Set an alarm for 8AM.”, then be asked to read the typed text three times. Now when Siri hears something comparable to those audio recordings, she’ll replace the transcribed text with your assigned text command. This will not only help with accents, but also allow people without strong accents to speak quicker and more naturally when asking questions to Siri that they make frequently.

    While Voice Teaching would help Siri hear what you said, Command Teaching would help Siri understand what you want. You might speak very clearly and have Siri transcribe your speech perfectly, but she still will try searching the web or saying she doesn’t know what you mean because your syntax wasn’t proper. You’d use this feature by going into the Command Teaching area in Settings, creating a new command, typing out your custom command, and then specifying the command’s meaning in a menu beneath, with options like “Launch [application]”, “Call [Contact]”, “Open [URL]”, or “Get directions to [Address]”. This can be used for practical commands that are misinterpreted or funny ones. For example, “Launch the app whose name I can’t pronounce” *Launches Edmodo*, or “Let’s go to Hell” *Gets directions to workplace*.

  • jabohn

    Your first item can kind of already be done. I can’t remember if this is new with iOS 11 or iOS 10, but on the Siri screen if it misunderstands what you said, you can tap edit to clarify your request. It should then learn from that correction.

    Your second item reminds me of “Speakable Items” in macOS and I wonder if something similar is in the pipeline, connected to the macro app Apple bought this year.