Speechless in the frontend

Leaving your developer comfort zone for edgy features like SpeechSynthesis can become a rollercoaster experience – but it's still worth the effort.

Herbert Braun

Associate Director Frontend

Published on15/08/2024

Frontend

SpeechSynthesis

As frontend engineers, we're a few years ahead of our customers. Modern features are often a hard sell, and that's why we were excited when one of our long-term clients agreed to our idea: Include a text-to-speech feature that would read text on their website to the users.

SpeechSynthesis isn't exactly bleeding edge – it has been in Chrome for almost 100 versions (!), all major browsers have been supporting it for years. Still, it feels like something new if only because it's not visual like most of our work.

So, browser support is excellent, speech output sounds very natural, the implementation is dead simple:

speechSynthesis.speak(new SpeechSynthesisUtterance("So simple!"));

Copy this line into the browser console and try for yourself. To use it on a website you only need to connect it to a user action but that's what we wanted to do anyway: We would include a speak button, filter the interesting part of the page's text content on click and make it a SpeechSynthesisUtterance, all wrapped in a nice WebComponent. What could possibly go wrong?

Canon of voices

Our esteemed colleagues from QA didn't take long to produce some problems. What do we do if the user opens another tab and starts speech there, too? We had even tested that during development – but only on Chromium browsers. Chromium silences other tabs as soon as you make a tab talking, but Firefox and Safari don't.

This seemed like a great feature by Chromium – but is it? It means I control things in another tab, making SpeechSynthesis a kind of superglobal.

Anyway, we had to find a solution which would include communicating with another browser tab. A simple way would be sessionStorage but we would have to clean up our mess when the tab is closed. Yes, you could listen to beforeUnload for that but MDN does a good job discouraging you to do that.

So, let's send a broadcast message, saying "Everybody shut up, I want to talk!" We implemented a listener which triggered speechSynthesis.pause(). Instead of a cacophony of voices we didn't hear anything at all because the sender of the message also gets a copy. Turns out, it's not trivial to find out if you are the sender of the broadcast message yourself! Broadcast messages do have an origin property with the URL … but of course the QA folks would test two tabs with the same URL. In the end we simply checked if the current tab is visible:

const channel = new BroadcastChannel("ix-speech");
channel.addEventListener("message", () => {
  if (speechSynthesis.speaking && document.hidden) {
    this.open = false;
  }
});

The component's open property contains a setter which changes the button states and stops the talk with speechSynthesis.cancel(). After the speak() command we would send a message like channel.postMessage('speechStarted') – problem 1 solved.

Too much talk

Accessibility is important for us. But even after years of working with a focus on removing barriers from web apps you miss some points – in this case that a text to speech function and an active screen reader don't work together well.

Okay, can't we not simply hide the trigger button? Nope, aria-hidden on a button is evil. Other than with ARIA attributes there is no reliable, non-hacky way to detect a screen reader (let that sink in). Our technical solution was … to solve the issue on another level: The button is clearly labeled and we added an ARIA description saying "DO NOT PRESS!" in polite words – so any screen reader user can make the informed choice to not press it. Problem 2 solved (sort of).

Mobile silence

We assumed that small portable devices that would occasionally be used as telephones might profit most from text to speech. Alas, this was were things got really ugly.

Some devices (we saw this on iPhone 17) claim to know SpeechSynthesis but remain silent. It still remains a mystery to us why (text too long? page too big?) – we still hope to find out one day. But we can work with that: Create a dialog warning users that some mobile devices don't do what they promise, show that dialog after waiting a few seconds and clear the timeout if the browser fires an event that it started talking – which would be the start event of SpeechSynthesisUtterance.

On some devices the browser did fire start, but followed immediately with an end event. This was our best try to limit the damage of bad browser support:

const showFail = () => {
  speechSynthesis.cancel();
  failDialog.showModal();
  this.open = false;
};
const speakingCheckTimeout = setTimeout(showFail, 5000);
let prematureEnd = true;
utter.addEventListener("start", () => {
  setTimeout(() => (prematureEnd = false), 1000);
  clearTimeout(speakingCheckTimeout);
});
utter.addEventListener("end", () => {
  if (prematureEnd) showFail();
});

We would wait a few seconds for the start event to fire, otherwise we'd stop speechSynthesis and show a <dialog>. The start event also starts a timer that sets prematureEnd to false after a short while. If there's an end event before that it would also trigger the fail dialog.

Unfortunately, we couldn't catch all browser misbehaviors. Some impostor devices fired start … and then nothing. Maybe they were still converting the text to spoken words? Mobile devices might require a few seconds for that. But even after stretching our timeouts we never heard from them. One would hope for the pending property in speechSynthesis but I never saw it other than false.

There is one event that guarantees something has been said, and that's SpeechSynthesisUtterance's boundary; it is fired after every word. That is … it should be because mobile browsers just refuse to. Relying on this event temporarily excluded all mobile devices from our speech feature.

Just to complete the list of troubles, there were also issues with Android devices that would not resume speaking after pausing. And of course bugfixing was a nightmare: Client behavior differed between devices, browsers, localhost vs. dev server, and we couldn't make the iPhone simulator talk.

So … never again?

The SpeechSynthesis API has some design flaws – for example it's not clear if Chrome's behavior concerning speech in different tabs is correct, and there are no events fired when the status of properties like speechSynthesis.paused change. But the real problem is the implementation which is the worst I've seen in a very long time as others have found out, too.

This feature caused headache, stress and several WTF moments. That the implementation mess came as such a surprise to us is also a reminder how amazingly well things usually work – as a survivor of the 90s' browser wars (document.layer anyone?) and having to deal with "IE vs. rest of the web" for several years I remember that things used to be much worse. Web standards are incredibly well supported nowadays – unless you go to their fringes and work with little-used APIs nested deep in the devices' capabilities.

So let's stick to the basic stuff – building accordeons and carousels and forms and exchanging data with the backend?

No. This kind of thinking leads to a dead end. This feature was a hard lesson, but a lesson nonetheless. It's important to leave your comfort zone from time to time – just don't be surprised if things get uncomfortable. On the outskirts of web standards browser support might be worse than you are used to (and worse than caniuse.com suggests). Plan some extra time for the unknown unknowns. And go ahead in the spirit of Progressive Enhancement: It doesn't have to be the same on every device as long as every user gets the best possible experience.

speechSynthesis.speak(new SpeechSynthesisUtterance("Thanks for reading!"));