Text to speech with Amazon Polly

Amazon Polly is one of the prominent features of AWS where it converts text into life like speech. Currently, Polly supports a wide variety of languages including female and male voices. Polly can be integrated for applications such as newsreaders, games, eLearning platforms, accessibility applications for visually impaired people etc.

Some of the benefits can be listed as follows.

Natural sounding voices
Store and re-distribute speech
Real-time streaming
Customize and control speech output
Low cost

Amazon Polly components

Input Text

The input text can be provided as a plain text or in Speech Synthesis Markup Language (SSML) format. With SSML speech can be controlled with respect to pronunciation, pitch, speech rate etc.

Available voices

Amazon Polly variety of voices related to different languages which includes female and male voices. Voice should be specified along with the input text in order to produce the audio stream.

Output format

Polly provides synthesized speech in multiple formats. For web and mobile applications can request the speech in mp3 or Ogg Vorbis format and for IOT devices and telephony solutions can request in PCM format.

Voices in Amazon Polly

Polly provides a variety of voices which belong to multiple languages. Polly also provides a special type of voice called Bilingual voice. These Bilingual voices are comfortable with more than one language so that it has the ability to speak up words and phrases in both languages. There exists only one Bilingual voice called Aditi which is compatible with both English and Hindi. Different voices have different voice speeds. At the same time we can deliberately change the voice speed using an option called SSML tags. For that we use SSML <prosody> tag.

There are two speed options.

Preset modes: x-slow, slow, medium, fast, and x-fast.

n% of speech rate: Any percentage between 20% and 200% can be used.

SSML

SSML provides an additional control over the text that needs to be converted into speech. SSML provides the following options.

Long pause with text

Change the pitch and speech rate

Emphasizing specific words and phrases

Phonetic pronunciation

Breathing sounds

Whispering

Newscaster speaking style

All SSML- enhanced text should be enclosed with a <speak> tag.

I. Long pause with text
<break> tag can be used for the long pause. Pause can be applied with two attribute values namely strength and time. Strength attribute values include none, x-weak, weak, medium, strong, x-strong. Here default strength attribute is medium.

<break strength="medium"> Time attribute can be given in second and milliseconds. <break time="3s"/> or <break time="100ms"/>

II. Emphasizing words.

Emphasizing change the rate and volume of the speech. Text is spoken in loud and slower when it comes to high emphasis. Emphasize is controlled by level attribute which has three levels namely strong, moderate and reduced. <emphasis level="strong">

III. Breathing sounds
Breathing sounds make the speech more natural and life-like.
<amazon:breath> and <amazon:auto-breaths> provides breathing effects for speech. There are three breathing options.

Manual mode: Manually set the location, volume and length of the breath sound within the text.

Automated mode: Amazon Polly adds the breath within the text.

Mixed mode: Mix of the above two modes

IV. Whispering

With the <amazon:effect name="whispered"> tag text can be whispered rather than the normal speech. Whispering speed can be controlled using <prosody> tag in between whispering tag. <speak> Normal speech is like this <amazon:effect name="whispered"> <prosody rate="-10%">But the whispering sound is like this </prosody></amazon:effect> </speak>

V. Newscaster style
The newscaster style is available only for Mathew and Joanna voices, which are available only in American English (en-US) in Neural format. The newscater tag can be added as follows.

<amazon:domain name="news">text</amazon:domain>
VI. Conversational speaking style
Conversational speaking style is also available only for Mathew and Joanna voices. This option gives the life-like converational effects to the speech.
<amazon:domain name="conversational">text</amazon:domain>

Lexicons
Lexicons enable you to customize the pronunciation of words. Lexicons can be stored in a particular region and can be used within that region. For an instance when we give the text as W3C but want to read that as World Wide Web Consortium, we can apply lexicons. <lexeme> <grapheme>W3C</grapheme> <alias>World Wide Web Consortium</alias> </lexeme> Here alias is the name that we want to read instead of W3C . So whenever Polly comes across W3C, that will be read as World Wide Web Consortium.There is a limit of five lexicons per speech.

Let’s log into AWS Management console and experience life-like speech with Polly. So for that log into the console and search for Amazon Polly service.

As in the above interface input can be given as plain text or SSML. You can try out different options explained in this article using SSML. When you click the button Listen to speech, you will hear the converted speech from the given text. At the same time they have given the mp3 download option.

Amazon Polly comes up with SDK and CLI options.

AWS SDKs - SDKs can be used when integrating Polly for existing applications.
AWS CLI - CLI can be used to access Polly without writing any code.

Search This Blog

Codes Oven

Text to speech with Amazon Polly

Comments

Post a Comment

Popular posts from this blog

Probabilistic Data Structures 1

Chef Deployment Tool

Static Web Hosting with a Custom Domain in AWS