There are two main AI approaches for synthesising singing (speech):
- Text to speech (TTS)
- Voice conversion or style transfer (VC)
The current state-of-the art in TTS is Tacotron-2, which needs 6 hours of data to learn a new voice. It has to be good quality, but not neccesarily studio quality: 22050 kHz sampling frequency, mono, 16-bit depth rate. This is probably not feasible now.
Voice conversion usually needs 5-20 mins of training data. However, this means that some source sentences need to be sung by someone (perhaps an AI), and they have to sing the same thing as the target sentences. If an AI sings it, in theory there is still the freedom of “choosing what to sing” (what lyrics). But note, because the source quality is lower, the target quality will likely to be also lower quality.
graph LR SourceVoice-->SourcePitch SourceVoice-->SourceSpectrum SourceVoice-->SourceBAP SourcePitch-->|VC|TargetPitch SourceSpectrum-->|VC|TargetSpectrum SourceBAP-->|VC|TargetBAP TargetPitch-->TargetVoice TargetSpectrum-->TargetVoice TargetBAP-->TargetVoice
My proposals are the following:
(1) Mellotron. We can use Willie’s style, but a pre-trained AI singing voice using Mellotron’s rhythm transfer.
graph LR Text-->|Tacotron2|Speech Willie-->|Mellotron|WilliePitch Willie-->|Mellotron|WillieRhythm Speech-->|Mellotron|AISING WilliePitch-->|Mellotron|AISING WillieRhythm-->|Mellotron|AISING
(2) Voice conversion. If we want to aim more ambitious, then we can try to use the Mellotron sentences as source sentences and Wilie sentences as target sentences during training. Then, we can make a Mellotron sing something else, i.e. copy the rhythm and the pitch of an existing singing.
There are several frameworks to perform voice conversion, Gaussian Mixture Model (more traditional machine learning) or Generative Adversarial Networks (hard AI).