Overview
This application can make use of Amazon Polly and Azure Speech for the actual audio generation, but also can take studio audio files and use it for natural voice. And it makes use of “character styles” to indicate if a character is speaking or use the “narrator.” You can also use paragraph styles for special occasion, like, for example, adding pauses before and after the text of a “Chapter,” Nuances such as raising a voice, speaking faster or emphasis is achieved through non-style application of modifiers such as bold, italic.
Special effects can be added using Sox syntax through highlighting. For example, yellow highlight is set to add an echo effect the current speakers voice.
Polly/Azure use standard xml markers that needs to be embedded inside a <speak>WHAT?</speak> document. Each call can only be using one “voice” so most of the complexity of the work is to build a series of statements that be read by ONE voice. Typically, character switches are the more likely to cause voice change, while character modifiers can be added and embedded.
Speech Generation
Voice tags can be from simple to complex, perform one task or many, use two different types of value. Example, volume can be set as an enumeration (loud, moderate…) or by decibel (+6dB). Here is an example from my logs (format: ‘(‘ <test to send>, <voice to use> ‘)’. Although Polly and Azure are using SSML to create the audio resulting from the instructions, A.B. Genesis uses more human and writing terminologies to allow construction of rich audio.
(<prosody pitch="-20%"><prosody rate="100%"><amazon:effect vocal-tract-length="+10%">I know… But, just for the principle… </amazon:effect></prosody></prosody>, Matthew)
This makes it harder to use directly. Therefore, in order to ease the projection of human notions unto Amazon Polly language is required.
[ssml]
emphasis=emphasis,level,strong,false
break=break,strength,medium,true
delay=break,time,1s,true
lang=lang,xml:lang,fr-FR,false
pitch=prosody,pitch,+20%,false
rate=prosody,rate,-10%,false
volume=prosody,volume,50dB,false
loudness=prosody,volume,loud,false
size=amazon:effect,vocal-tract-length,101%,false
phonation=amazon:effect,phonation,soft,false
The format is AB Genesis verb = Polly tag, attribute name, default value (to create test statements) and whether or not the element is before or after as opposed to enclosing.
Polly also allows to retrieve the available voices for various languages and locals. Currently us-EN is used by default.
What is a Neural Voice?
For all intent and purposes, a voice is an atomic element that can achieve also some modulation. Here are the main ones:
| Azure | Polly | |
| Rate | Yes | Yes |
| Pitch | Yes | Yes |
| Volume | Yes (in ratio) | Yes (in dB) |
| Styles | Yes (a few voices) | No |
| WPM | Yes (many voices) | No |
| Pauses and breaks | Yes | Yes |
| Locale | Yes | Yes |
| Gender | Yes | Yes |
Styles
Styles are fairly new and allow to select a variant on a voice (sad, cheerful, newscasters…). It is available only on a few voices. To compensable, A.B. Genesis implements some alternatives as style for when something is not supported.
Example: The Style “whispering” can be somewhat mimicked by lowering the volume by 20% and slightly accelerating the pace. It’s not perfect, but it does the job. This puts some emphasis that voices with styles should be used for main characters or colorful ones.
Managing Voices
Voices are management through the “Voices” tab.

Voices in gray are new voices without a local sample file to speed up finding out the “right voice” for the character. The voices in green support styles. Hovering the mouse over will show the list of all the styles supported by the voice. Right-clicking on a voice will offer to create a sample or to create a character out of that voice.