Oral cancer speech corpus

Synthesising, understanding or recognising pathological speech are currently big problems. This dataset is aimed to those people who want to investigate new avenues of solving this problem, but lack the data to do so.

This site’s purpose is to inform interested people about the changes and recent developments in the oral cancer speech dataset, and provide links to the publications using it.

Name of linkURLFor whom?
Official Zenodo Pagelinkcontains Kaldi features too for ease of reproduction
Full corpuslinkfor people who are just interested in the data and want to process in their own way
Detector and analysis partitionlinkaimed for people who want to improve on the baselines quickly
Detector and analysis codelinkfor people who want to reproduce the GMM/LASSO experiments or just understand the methods
Detector and analysis paper (arXiV)linkfor people interested in our conclusions regarding oral cancer speech
Metadata for the datasetlinkmetadata about the speakers in the dataset

Please if you use this dataset, cite either the arXiv version (until the official IS 2020 citation)

    title={Detecting and analysing spontaneous oral cancer speech in the wild},
    author={Bence Mark Halpern and Rob van Son and Michiel van den Brekel and Odette Scharenborg},


The YouTube data is available under fair use. The dataset is available under the Creative Commons 4.0 license. We encourage research and educational use of the dataset, but commercial use is not allowed.


2020-09-19: Uploaded the metadata (protocol) file for the dataset

2020-09-12: Uploaded the dataset in a Google Drive version to allow ease of use and access.


This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement No 766287. The Department of Head and Neck Oncology and surgery of the Netherlands Cancer Institute receives a research grant from Atos Medical (Horby, Sweden), which contributes to the existing infrastructure for quality of life research.