Summary on what is feasible and what is not feasible

Date:

There are two main AI approaches for synthesising singing (speech):

  • Text to speech (TTS)
  • Voice conversion or style transfer (VC)

The current state-of-the art in TTS is Tacotron-2, which needs 6 hours of data to learn a new voice. It has to be good quality, but not neccesarily studio quality: 22050 kHz sampling frequency, mono, 16-bit depth rate. This is probably not feasible now.

Voice conversion usually needs 5-20 mins of training data. However, this means that some source sentences need to be sung by someone (perhaps an AI), and they have to sing the same thing as the target sentences. If an AI sings it, in theory there is still the freedom of “choosing what to sing” (what lyrics). But note, because the source quality is lower, the target quality will likely to be also lower quality.

graph LR
SourceVoice-->SourcePitch
SourceVoice-->SourceSpectrum
SourceVoice-->SourceBAP
SourcePitch-->|VC|TargetPitch
SourceSpectrum-->|VC|TargetSpectrum
SourceBAP-->|VC|TargetBAP
TargetPitch-->TargetVoice
TargetSpectrum-->TargetVoice
TargetBAP-->TargetVoice

My proposals are the following:

(1) Mellotron. We can use Willie’s style, but a pre-trained AI singing voice using Mellotron’s rhythm transfer.

graph LR
Text-->|Tacotron2|Speech
Willie-->|Mellotron|WilliePitch
Willie-->|Mellotron|WillieRhythm
Speech-->|Mellotron|AISING
WilliePitch-->|Mellotron|AISING
WillieRhythm-->|Mellotron|AISING
OriginalMellotron transferVC

(2) Voice conversion. If we want to aim more ambitious, then we can try to use the Mellotron sentences as source sentences and Wilie sentences as target sentences during training. Then, we can make a Mellotron sing something else, i.e. copy the rhythm and the pitch of an existing singing.

There are several frameworks to perform voice conversion, Gaussian Mixture Model (more traditional machine learning) or Generative Adversarial Networks (hard AI).