Transcribing 3 hours of pathological speech

4 minute read


It has been again a month since my last blog post, but here we are again. Recently, I started a collaboration in Delft so that we can discover even more aspects of pathological speech. For that, I had to do transcriptions of three hours of oral cancer speech, and I’ve learned many lessons which I think are interesting for other people and might be widely applicable.

1. Conversational language is pathological in terms of the grammar

While I had expected spoken language to be less polished and less grammatical than its written counterpart, transcribing each word made me realise that grammar and spoken language are sometimes two disjoint sets. In English, you can get away surprisingly well by repeating “It’s like …” and then appending your favourite (and relevant) words. This is not a problem in general, but sometimes grammar is really a last resort in transcribing pathological speech. A prime example of that is in the case of plural/singular version of the same word, where the grammatical context helps more than the actual acoustic cues in the audio.

2. When you transcribe, you don’t look at the recordings only once

While computationally, there is no such thing as multiple listening to an audio, but when transcribing these speeches you often have to replay a couple of times. The funniest part is the adaptation process, after listening to a couple of recordings from the same person, you can indeed get better – sort of like a “human” transfer learning. Some improvements really come from hearing about similiar content in another recording.

3. What do you do with paralinguistic content?

For pathological speech, coughs, hmms, gasps should be considered very important, but I haven’t included them in the initial transcriptions. I don’t want to bat around this issue a lot, it is clear that depending on the exact goal we want to achieve this should have a greater/smaller role. Some work that I’ve seen during Interspeech this September really made me realise that these informations might be not important for certain ASR applications, but very important for convincing TTS applications.

4. Issues with wild data in general

When a speaker is listing the names of his/her doctors, relative’s name, one cannot really be sure about the exact way to write it. It’s a less application-dependent issue, but it is a really annoying feature of languages which are not phonetic, since mapping the acoustics to the actual graphemes are not straightforward. In my mother tongue, Hungarian, this is largely only a problem with very tradiationally written names, i.e. we would usually write “Kovacs”, but “Kovats”, and maybe sometimes “Kovach” is an accepted transcription, similarly with “Grabovszky” or “Grabovszki”. In English, “Wasovschki” could be basically anything from mishearing “was how ski”, “razor ski”, or just simply “Wasovsky”.

Again, because of the wild nature you have things like pronouncing out the name of a brand in two different ways, which is a completely natural thing to do in a conversation. Like, imagine asking your English teacher how do you spell “mcdonalds” and then his/her answer being “mcdonalds”, but with the correct pronounciation. These kind of constraints can be alleviated with phonemic transcriptions, but you have to really decide for yourself – is it the intent or the acoustics that counts? I would say for pathological speech the intent is more important.

5. Some ASRs make you think really hard

Some baseline ASR results make you think really hard what was actually said, because often times they provide beautiful, equally valid transcriptions. But long-term contextual cues still seem to be a big weakness of them, and very specialised terminology and grammar too. While I cannot blame the baseline ASR not getting “hemiglossectomy” and “glossectomy” right, I have to say specialised vocabulary issues were usually the most common by far. Other times, however, seeing the baseline transcription is followed by a lot of pondering of what was actually said.