Text-to-speech with R

Computers started talking to us! They do this with so called Text-to-Speech (TTS) systems. With neural nets, deep learning and lots of training data, these systems have gotten a whole lot better in recent years. In some cases, they are so good that you can’t distinguish between human and machine voice.

In one of our recent codecentric.AI videos, we compared different Text-to-Speech systems (the video is in German, though - but the text snippets and their voice recordings we show in the video are a mix of German and English). In this video, we had a small contest between Polly, Alexa, Siri And Co to find out who best speaks different tongue twisters.

Here, I want to find out what’s possible with R and Text-to-Speech packages.

PS: In a second post I also tried the googleLanguageR package - with much better results!

How does TTS work?

Challenges for good TTS systems are the complexity of the human language: we intone words differently, depending on where they are in a sentence, what we want to convey with that sentence, how our mood is, and so on. AI-based TTS systems can take phonemes and intonation into account.

There are different ways to artificially produce speech. A very important method is Unit Selection synthesis. With this method, text is first normalized and divided into smaller entities that represent sentences, syllables, words, phonemes, etc. The structure (e.g. the pronunciation) of these entities is then learned in context. We call this part Natural Language Processing (NLP). Usually, these learned segments are stored in a database (either as human voice recordings or synthetically generated) that can be searched to find suitable speech parts (Unit Selection). This search is often done with decision trees, neural nets or Hidden-Markov-Models.

If the speech has been generated by a computer, this is called formant synthesis. It offers more flexibility because the collection of words isn’t limited to what has been pre-recorded by a human. Even imaginary or new words can easily be produced and the voices can be readily exchanged. Until recently, this synthetic voice did not sound anything like a human recorded voice; you could definitely hear that it was “fake”. Most of the TTS systems today still suffer from this, but this is in the process of changing: there are already a few artificial TTS systems that do sound very human.

What TTS systems are there?

We already find TTS systems in many digital devices, like computers, smart phones, etc. Most of the “big players” offer TTS-as-a-service, but there are also many “smaller” and free programs for TTS. Many can be downloaded as software or used from a web browser or as an API. Here is an incomplete list:

Microsoft/Windows: includes Narrator and Microsoft Speech API
Mac: VoiceOver
Linux: different software can be installed, e.g. eSpeak
IBM Watson
Google Cloud
Microsoft Azure
Amazon Alexa
Siri on iPhone
Polly on Amazon AWS
Microsoft Cortana
FreeTTS
iSpeech
Natural Readers
Balabolka
Panopreter
text2speech.org
text-to-speech-translator.paralink.com/

Text-to-Speech in R

The only package for TTS I found was Rtts. It doesn’t seem very comprehensive but it does the job of converting text to speech. The only API that works right now is **ITRI (http://tts.itri.org.tw)**. And it only supports English and Chinese.

Let’s try it out!

library(Rtts)

## Lade nötiges Paket: RCurl

## Lade nötiges Paket: bitops

Here, I’ll be using a quote from DOUGLAS ADAMS’ THE HITCHHIKER’S GUIDE TO THE GALAXY:

content <- "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

The main TTS function is tts_ITRI() and I’m going to loop over the different voice options.

speakers = c("Bruce", "Theresa", "Angela", "MCHEN_Bruce", "MCHEN_Joddess", "ENG_Bob", "ENG_Alice", "ENG_Tracy")
lapply(speakers, function(x) tts_ITRI(content, speaker = x,
         destfile = paste0("audio_tts_", x, ".mp3")))

I uploaded the results to Soundcloud for you to hear: - audio-tts-bruce - audio-tts-theresa - audio-tts-angela - audio-tts-mchen-bruce - audio-tts-mchen-joddess - audio-tts-eng-bob - audio-tts-eng-alice - audio-tts-eng-tracy

As you can hear, it sounds quite wonky. There are many better alternatives out there, but most of them aren’t free and/or can’t be used (as easily) from R. Noam Ross tried IBM Watson’s TTS API in this post, which would be a very good solution. Or you could access the Google Cloud API from within R.

The most convenient solution for me was to use eSpeak from the command line. The output sounds relatively good, it is free and offers many languages and voices with lots of parameters to tweak. This is how you would produce audio from text with eSpeak:

English US

espeak -v english-us -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_en_us.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

just for fun: English Scottish

espeak -v en-scottish -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_en-scottish.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

even funnier: German

espeak -v german -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_german.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

The playlist contains all audio files I generated in this post.

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rtts_0.3.3      RCurl_1.95-4.10 bitops_1.0-6   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17    bookdown_0.7    digest_0.6.15   rprojroot_1.3-2
##  [5] backports_1.1.2 magrittr_1.5    evaluate_0.10.1 blogdown_0.6   
##  [9] stringi_1.2.3   rmarkdown_1.10  tools_3.5.0     stringr_1.3.1  
## [13] xfun_0.2        yaml_2.1.19     compiler_3.5.0  htmltools_0.3.6
## [17] knitr_1.20

Text-to-speech with R

How does TTS work?

What TTS systems are there?

Text-to-Speech in R

Dr. Shirin Elsinghorst