It was one of my mini PhD goals to train a full-fledged Tacotron 2 model, but I lacked the project goal and resources to do it. So my winter “vacation” came, meaning I could finally fiddle with the Mellotron model.
For those who are unfamiliar with Mellotron, fear not, this is a relatively recent model of NVIDIA, it is simply put a Tacotron 2 variant conditioning speech information on pitch and rhythm information. This means, that in theory it should be able to generate singing without actual singing data. And in fact, it does surprisingly well knowing that it doesn’t use singing data, though it cannot extrapolate pitch well. (the authors say in their paper in case the pitch is out of domain for a singer/speaker, the pitch defaults to the max/min)
So I decided to have a look at their examples, and trained a Mellotron on the crunchy Mozilla Common Voice Dutch corpus. After talking with some industry experts, and looked around on the internet, a week of training with 4 GPUs are usually recommended, but the loss curve graciously converged in a day for the Dutch TTS (where S stand for sing now). So I listened to it and here are few insights that I want to share:
- The Tacotron 2 usually works with at least of 3-6 hours of clean audio data from a single speaker
- Although I cannot draw general conclusions on the quality of the Common Voice dataset, as an “open contribution” dataset, it certainly has some obvious limitations compared to read speech by voice talents. The acoustic conditions are not well controlled, and although there are more than 400 speakers, about 3 hour of data is from one single speaker according to the client ids.
- (Conclusion) Tacotron 2 will not typically works with the Common Voice Datasets
Related to Mellotron, let me do some praising first, because even though I am unsatisfied with some aspects of it (which has more to do with me as a person), I really like how easy it was to use, and seemed to work out of the box. The Nicki Minaj example from their results page, I found a bit misleading, because the whole style transfer seemed to struggle if there is background music.
My idea was to translate and reproduce a Dutch folk song with the style transfer setup of the Mellotron.
What makes Mellotron difficult to use in general is that the phonetic content imposes some pitch and duration in itself, so when you try to translate a song, you end up getting a weird alignment. I think this is typical thing happening also in non-AI song translation.
The limiting factor really is the lyrics at this point. I’m probably gonna get back to this model eventually, it was a nice lesson in speech synthesis.