I spent two months, as part of my Marie Curie PhD program, at Oxford Wave Research, where I could learn about forensic speech technology from world-leading experts.
My internship was centred on the research of neural architectures for a spoofing countermeasure. But what is spoofing? It is a way to fool automatic speaker verification systems, which are supposed to identify whether it is you or somebody else is speaking. As your fingerprint or DNA, your voice can be also used as a biometric identifier, to identify yourself when it is needed. And as with any biometrics, identity theft is a significant threat.
Imagine that you secure your mobile phone in a way so that it can be only unlocked with your voice. But suppose now, that somebody else records your voice while you are talking to your friend at a cafe and uses it to unlock your phone. This is known as a replay attack in the jargon of spoof detection, indicating “physical access” to your voice.
A commonly employed technique against these type of attacks is to use a special phrase that you would not say during a conversation. This way, it is highly unlikely that somebody records your voice while you are not paying attention. However, given enough recordings, a very convincing text to speech synthesiser could be built to synthesise your passphrase. This way the intruder gains “logical access” to your voice, as it knows the logical rules how to generate anything you say. How do you convince your phone that your speech is the real one (also known as bonafide) and not the synthesised one (also known as spoofed)?
The ASVspoof challenge
The ASVspoof challenge provides a way to train, validate and benchmark any spoofing countermeasure, with a wide set of examples from the latest voice conversion, text to speech technologies (logical access), and also including replay attacks (physical access).
The challenge’s evaluation paper shown some really interesting consequences about the state of spoofing countermeasures. From the generation perspective, we are usually concerned about psychoacoustic performance – does the speech sound natural to humans? Thus, it is not surprising that it is possible to exploit weaknesses of these systems at the low frequency regime, which are less important for naturalness. This is emphasised by the use of constant Q-transform and linear frequency cepstral coefficients.
Also, single neural networks (not fused) had a really hard time performing better than the baseline Gaussian mixture models in the ASVspoof challenge. This raises certain doubts about neural networks being the best for all applications and reminds us of the need for explainable and understandable neural networks.
Technical skills I learned
I had set a lot of personal challenges for myself during the internship. I didn’t have very much exposure to PyCharm before, so I used that to have a better idea of what is missing from my emacs routines (a lot of things). I also had some opportunity to work with some voice conversion tools.
Oxford in general
I feel like in every single one of my blogposts I give a dining guide for the city I’ve been to, but I do think there are some places in Oxford, which worth to try. For the Asian perspective, Opium Den is one of the nicest restaurants in Oxford, but make sure to bring your friends if you eat alone usually, because the sharer menus are the real deal there. The Rickety Press is also a very nice restaurant, some Oxford students are actually convinced, that each one of their items in the menu is delicious, without an exception. However, avoid Green Bamboo – I had a very bad eating experience there.
The Oxford Wave Research team having a dinner.
My time at the company
Oxford Wave Research was an intellectual and fast-paced environment place to learn more about speech forensics. I have received a lot of feedback on my coding style from Oscar Forth, it is a really rare opportunity that I can get some code review from other people, but I managed to get some during my stay at the company. I was also able to taste one of the most special Turkish delights during my stay at the company, and most importantly I had very lengthy, and insightful discussions of intermittent fasting, ketogenic diet and calorific theory with Dr Anil Alexander, which I had tremendously enjoyed. I would also like to thank Dr Finnian Kelly, for giving me directions, when I was lost and sharing your expertise in score fusion and voice conversion. And last but not least thank you Nikki, Sam and Zofia for all the meaningful interactions we had, and expressing your enthusiasm and highly-sustained work morale in this wonderful company.