Date: 11 - 8 - 21
Imagine this: It’s Thanksgiving, and you’ve been longing to treat your loved ones to a delicious, home-baked sweet potato pie. You don’t quite remember the recipe, so you grab your smartphone and head over to your favorite cooking blog. As soon as you land on the recipe post, you notice a small audio player at the top of the page. Curious, you click “play,” then a smooth, natural-sounding voice begins reading the recipe to you out loud. At that moment, you realize you can follow the directions by “listening” to the recipe instead of having to glance over at the screen every second. This is all made possible using a simple but powerful technology called “Text-to-Speech.”
Text-to-Speech has been around for decades, but website owners and bloggers have only recently started realizing its immense power and value. This has a lot to do with the fact that people are listening now more than ever, and publishers need a way to effectively meet the demand and expectation of growing listenership. But how does Text-to-Speech work in the first place? What is it, even? We’re happy you asked because this article explains that and much more.
Text-to-Speech, abbreviated as TTS, is a technology that converts digital text to human-like speech. It can take text on a computer or other digital device and read it aloud as natural-sounding audio with a simple click of a button or touch of a finger.
Text-to-Speech has gained immense popularity because of how simple and accessible it is. It’s compatible with most (if not all) mobile devices, including smartphones, laptops, desktops, and tablets, and can read all kinds of text files, from Word and Page documents to online web pages.
What’s more, it can be a convenience feature or an accessibility tool for children and adults struggling with low vision, blindness, or issues related to focusing on, learning, and reading text printed on a screen.
Let’s say you have a block of text that you want your computer or mobile device to speak aloud. How does it turn the words into ones you can actually hear? Believe it or not, only three stages are involved: converting the text into words, transforming the words to phonemes, and then turning the phonemes to sound.
Here’s a detailed breakdown of what goes on in each phase:
The initial stage of TTS is generally called preprocessing or normalization. It involves preparing the text so the computer can understand it and make fewer mistakes when reading the words aloud.
A special algorithm scans the text and converts numbers, dates, abbreviations, acronyms, punctuation, and special characters into words. However, the algorithm has to determine whether “1923” means “nineteen twenty-three,” “one thousand nine hundred and twenty-three,” or “one, nine, two, three” before it can break down the text into its constituent parts, for instance.
While this is often an easy feat for humans, computers have to use statistical probability techniques or neural networks to arrive at the most likely interpretation. So, if the word “year” appears in the same sentence as “1923,” it might be reasonable to interpret it as a date and pronounce it as “nineteen twenty-three.”
Preprocessing also has to decipher homographs (words that share the exact spelling but have different pronunciations, depending on their meanings). A perfect example of a homograph is the word “read.” It can either be pronounced as “red” or “reed.”
So, a sentence like “I read a story” poses an immediate problem for a speech synthesizer. However, if it can recognize that the preceding text is entirely in the past tense, by looking at past-tense verbs like “I woke up” or “I had breakfast,” it can make an informed guess that “I read [red] a story” is probably correct. Likewise, if the preceding text is “I wake up” or “I have breakfast,” then “I read [reed] a story” would most likely be the correct pronunciation.
Now that the system has figured out the words to be spoken, the computer has to convert the words into sound sequences. Since each word can be pronounced differently based on its meaning and context, the computer needs a list of phonemes to understand how to pronounce each word.
Phonemes here are the building blocks of spoken words. For example, “cup” consists of three phonemes: a /k/ sound represented by the letter “c,” a short vowel /u/defined by the letter “u,” and the /p/ at the end.
The TTS engine matches the combination of letters to the corresponding phonemes to build a phonemic transcript. Because some words have multiple pronunciations, the system must consult with specific pre-programmed rules to determine the correct pronunciations.
In addition to phonemes, the TTS engine identifies intonations like syllables with slightly raised or lowered pitches, some extra volume here or there, or gradually longer duration, like the “but” in “butter.” The text is then converted into a string of intonated phonemes to be turned into sound.
During the final stage, the system uses an acoustic model to read the processed text. A machine-learning algorithm then establishes the connection between the phonemes and sounds to give them accurate intonations.
Following that, the computer uses a sound wave generator to create vocal sound. The frequency characteristics of phrases are eventually loaded into the sound wave generator. These characteristics are usually obtained from recordings of humans saying the phonemes, computer-generated sound frequencies, or an approach that involves mimics the mechanism of the human voice.
Many TTS systems allow users to choose the type of voice, such as male or female, the language, the speed of playback, etc. Some can also read texts and output them in a human-like manner (with all the intonations and cadences), while some may sound robotic and very dull.
There are many different TTS tools available based on where the technology is needed.
Text-to-Speech technology has become so popular that many people come across it every day without even realizing it – and you probably have, too. That’s expected because as technology gets more advanced, it becomes more challenging to figure out if you’re listening to a simple recording or Text-to-Speech is at play.
Here are a few places you are likely to encounter Text-to-Speech as you work your way through a typical day:
Text-to-Speech comes built-in in many word processors, like Microsoft Word. Word, in particular, has a “Read Aloud” feature in the “Review” menu that will read the current document aloud if you desire. Google Docs also has Text-to-Speech functions, but you’ll need an add-on to use them.
Accessibility features, like Text-to-Speech, are embedded in almost every type of computer or smartphone on the market. In Windows and Mac, you can enable the Narrator feature to describe aloud what’s on your screen so you can use that information to navigate your device. Smartphones typically come with voice assistant features that provide spoken feedback to help blind or low-vision users.
Most popular eBook readers, including new Kindle Fire devices, have a Text-to-Speech option. This also includes online readers, such as Internet Archive. When purchasing an eBook for Kindle Fire, you can check if it can be read aloud by looking for the “Text-to-Speech: Enabled” label on its details page before buying it.
Some newer ATMs are equipped with Text-to-Speech functions to provide services for customers who have difficulty reading screens. For example, step-by-step audio helps users withdraw cash, check account balances, and make deposits.
Text-to-Speech is most often seen with smart assistants like Amazon’s Alexa, Apple’s Siri, Google Assistant. These assistants use Text-to-Speech to provide news and weather updates, issue reminders, and respond to questions and comments. They usually work by tapping into a predetermined library of words and phrases. Smart speakers also use Text-to-Speech technology to perform many of their core functions.
You might have an alarm clock that wakes you up by speaking the time, or perhaps you’ve heard of the feature. In either case, that’s another common application of Text-to-Speech.
Google Maps, Apple Maps, and most other modern GPS software and apps are designed to read turn-by-turn directions aloud using Text-to-Speech technology.
Text-to-Speech has been around for some time now, but it has grown to become an integral part of many applications and technologies we use today, from word processors and virtual assistants to modern ATMs and GPS software. Impressively, Text-to-Speech uses a three-stage process to read textual content aloud, first by converting the text into words, transforming the words to phonemes, then turning the phonemes to sound. As artificial intelligence (AI) and other technologies expand what can be accomplished with speech synthesis, Text-to-Speech will inevitably continue to rise and become a must-have feature for businesses trying to find their voice and compete in a digital space.