In recent years, the need to increase engagement of their online audiences has led publishers to explore audio as an alternative consumption channel for their published content.
In 2018, Denise Law, then Head of Strategic Development at The Economist, described how the Economist introduced audio editions to combat the so-called ‘unread guilt factor’.
Fast forward to 2023 and the latest report from Reuters Institute reveals that 72% of publishers are planning to put more resources into digital audio.
But how can publishers commit scarce resources to podcasts, or other human-read audio, without compromising on the depth and quality their customers expect from written content?
The advent of AI audio may well provide the answer.
The quality gap
Until recently, the quality of AI voices limited their application to functional uses, such as in AI assistants, satnav and IVR systems.
Since 2018, our mission at BeyondWords has been to close this ‘quality gap’ by improving AI voices specifically for the news and corporate domains. Publishers can now make their written content available through high-precision, authentic audio.
In this post, we’ll outline how BeyondWords overcomes the complexities of text-to-speech and enables content publishers to build and deliver natural-sounding audio experiences.
Overcoming the complexities in text-to-speech
Text-to-speech is challenging due to the complexity of converting ambiguous, non-phonetic text into high-dimensional speech. We address this by focusing our efforts in three areas:
Written text encompasses both standard and non-standard words, such as:
- Numbers (18, 45, 1/2, 0.6)
- Dates (2/2/23, 2nd Feb 2023, 2 Feb 23)
- Times (10:25am, 15:35:23, 12pm)
- Currencies ($3.85, €23,34, Y23 000)
- Units (5’11, -15C, 235 mpg)
- Abbreviations (APPL, EDIBTA, IPO)
- Symbols ( %, ∞, =), and more.
In news articles and investment reports, non-standard words must be transformed for pronunciation (verbalized) as they cannot be pronounced directly.
For example, acronyms like "IPO" must be pronounced as separate letters: "I-P-O", while others, such as "EBITDA," must be spoken as a word: “ebitda”. The symbol "%" must be verbalized as "per cent," and numbers like "18" must be spoken as "eighteen."
Homographs, like "read" (verb /ri:d/) and "read" (adjective /red/), also require different verbalization.
Additionally, specific abbreviations in different sectors must be expanded to ensure audio comprehension, for instance "sen." as "senator" in political news or "con." as "consensus" in investment reports.
We resolve these tricky and ambiguous cases by using advanced text normalization techniques, such as NLP (Natural Language Processing) and semiotic classification, combined with machine learning (ML) and rule-based models to process thousands of articles per day.
Our text normalization system is maintained and improved by an expert team of ML engineers whilst teams of linguists fine-tune models in each new local
Once text is normalized, the next step is to determine the best pronunciation.
We do this by transforming the normalized text into a phonetic representation (a ‘phone sequence’). This step is carried out by custom-built grapheme-to-phoneme (G2P) models together with IPA (international Phonetic Alphabet) lexicons.
Maintaining these lexicons is time-consuming and costly, but they give us precise control over pronunciations, useful when, for example, different dialects in a language pronounce the same word differently.
Acoustic modeling and waveform generation
Once the text has been transcribed into its phonetic representation, we proceed to predict the speech waveform from the generated phone sequence.
In this step, we follow the popular practice of dividing this into the sub-tasks of acoustic modelling (using a ‘synthesizer’) and waveform generation (using a ‘vocoder’).
These ML models help predict nuanced characteristics in speech such as prosody (the “rhythm & intonation” of a voice) and timbre (the “unique character” of a voice).
Building and delivering retentive audio experiences
BeyondWords provides a cutting-edge solution that goes beyond text-to-speech to deliver the functionality of a fully-featured ‘headless’ audio CMS.
The solution integrates the hosting of audio content, usage analytics, as well as distribution and monetization functions. It allows publishers to develop dynamic audio strategies that can create unique listening experiences, enhance digital subscription bundles or generate revenue through audio ad integrations (and soon, video).
Research has shown that listeners spend around 4 times longer on the page than readers, an indication that high-quality AI audio is capable of significantly changing the way publishers connect with their audiences.
As voice quality and functionality continues to improve, this format is sure to drive new engagement with news and corporate publishing.
BeyondWords integrates with Eidosmedia
BeyondWords is now available to authors and editors using Eidosmedia editorial applications. In both the desktop and mobile versions, users click to generate the audio version of a written text on the fly, without leaving the editing workspace. A link to the audio content is automatically embedded in the article, enriching reader options with minimal expenditure of time or effort.
Find out more about text-to-speech and the Eidosmedia integrations.