Last year, some friends discovered that I didn’t use the digital assistant on my phone, very often. This seemed counterintuitive, given that I am a linguist interested specifically in speech, including speech recognition. So, as an exercise, I reactivated my “Hey Siri” feature and said “Hey Siri” a handful of times into my microphone when prompted. I have since embarked on my journey to talk to a rectangular piece of plastic and glass, with varying degrees of success. But something I noticed was that the voice activation feature on my phone (“Hey Siri”) didn’t work for the people around me. It seemed that from only very few examples of my voice, the software had specialized to my voice and only my voice.  

So, I set out to do a highly uncontrolled observational study: who could activate my Siri? I first asked people to try to voice-activate my phone with no instruction, then asked them to try to imitate my voice. Out of the ten or twenty people that I asked, two (who had similar voices to me) were able to do it without any coaching. One other was able to voice-activate my phone by pitching her voice lower and imitating my voice. 

 Specializing a digital assistant to a specific user is an interesting task in language technology. In much of natural language processing, the goal is to be as speaker-independent as possible, since it is of course desirable that machine learning speech recognition works on voices that are not in the training set. This means that in setting up “Hey Siri” I was training it to understand my specific voice for activation. With only a few data points, it appears that the algorithm vastly reduces the number of voices that are accepted. However, it must permit a variety of inputs that are sufficiently close to that voice, since a few data points are not going to capture the full range of ways that a single user will say even such a simple phrase. The side effect of this is that voices that are sufficiently similar to the user’s voice are also accepted by the algorithm.  

As a linguist, my next question is: What makes my voice similar to the people who also activated my phone, and what makes it different? My scientific curiosity has been piqued, and I am looking forward to the next phase in my digital assistant journey: talking to Siri exclusively in silly voices to see how far a departure from my voice is tolerated by the algorithm.

Author: 

Emily Grabowski

I am a PhD student in Linguistics. My research interests include understanding how our speech production and speech perception systems constrain linguistic variation, especially as it applies to the larynx. I am also interested in integrating theoretical representations of language with speech. I approach this using a broad variety of tools/methodologies, including theoretical work, experiments, and modeling. Current projects include developing a computational tool to expedite analysis of pitch and an online perception experiment on the relationship between pitch and perceived duration.