Fiddling with papers: MOSNet

4 minute read


One of the recurring themes in my PhD is the topic of naturalness of signals, in particular the naturalness of speech. It is challenging to define what we mean by naturalness – in fact this is the problem itself. Though, we could find a definition for it in “natural language”, it wouldn’t give us the mathematical formula for it.

In the machine learning literature this problem is observed in different ways. There is the phenomenon that minimising a trivial loss function, such as mean squared error won’t neccesarily give us natural samples of speech. This is due to the fact that these models learn “mean examples” - a training set of red and green birds is going to give us some blue birds. Generative adversarial networks try to alleviate this problem, by teaching neural networks what samples are natural or not, which can be used as a loss function then.

Last autumn, during Interspeech I briefly saw this paper called MOSNet, which is a simple neural network trying to solve this problem. It is using a CNN-LSTM feature extractor to obtain frame-level mean opinion scores, and then a global averaging operation to obtain utterance-level mean opinion scores. These scores are both used in the final loss function with equal weighting. This is an interesting idea, as the problem with naturalness is that it can be a global phenomena (like consistently bad speaker quality) or local phenomena (i.e. when the intonation is suspicously wrong at a few places).

The overall paper seems promising, and there is a GitHub repository associated with it here. The generalisation results on other voice conversion datasets show promise, and I managed to succesfully run it on some audio samples. There is a pre-trained model in the repo, so it’s relatively easy to set up a pipeline with it. Follow these steps:

  1. Make a directory wav inside the data folder and copy your wav files to it. The model was trained with 16000 kHz audio, so I would probably downsample with

sox input.wav -r 16000 output.wav

first. You can check the sampling rate with soxi.

  1. You can run then the python, which does a magnitude spectrogram calculation for you and places it in a disk-contigous format, so it will be a bit faster if you are trying to something larger scale on it.

  2. You need to rewrite a little bit, here is what I’ve done to just calculate the scores for all processed h5 files:

import os
import time 
import numpy as np
from tqdm import tqdm
import scipy.stats
import pandas as pd
import matplotlib
# Force matplotlib to not use any Xwindows backend.
import matplotlib.pyplot as plt
from glob import glob
import tensorflow as tf
from tensorflow import keras
import model
import utils   
import random

# 0 = all messages are logged (default behavior)
# 1 = INFO messages are not printed
# 2 = INFO and WARNING messages are not printed
# 3 = INFO, WARNING, and ERROR messages are not printed
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}

# set memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
# set dir
DATA_DIR = './data'
BIN_DIR = os.path.join(DATA_DIR, 'bin')
PRE_TRAINED_DIR = './pre_trained'
OUTPUT_DIR = './output'

NUM_TRAIN = 13580

if not os.path.exists(OUTPUT_DIR):
print('{} for training; {} for valid; {} for testing'.format(NUM_TRAIN, NUM_VALID, NUM_TEST))    

# init model
MOSNet = model.CNN_BLSTM()
model =

# load pre-trained weights
model.load_weights(os.path.join(PRE_TRAINED_DIR, 'cnn_blstm.h5'))   # Load the best model   


test_list = glob(os.path.join(BIN_DIR,"*.h5"))

for i in tqdm(range(len(test_list))):
    _feat =[i])
    _mag = _feat['mag_sgram']    
    [Average_score, Frame_score]=model.predict(_mag, verbose=0, batch_size=1)

The full paper can be read ([here].