#3: Launch a SaaS like ElevenLabs or Murf.ai next week with Open Source
Dive into the AI voice generation market, and learn how to build your own with these resources!
Most successful SaaS products weren’t the first of their kind - think Slack, Shopify, Zoom, Dropbox, or HubSpot. They didn’t invent team communication, e-commerce, video conferencing, cloud storage, or marketing tools; they just made them better.
So what are AI voice generation tools?
Voice generation (a.k.a. Text-to-Speech / speech synthesis) is an AI task that turns text into natural sounding speech. AI voice generators can create realistic voiceovers and dialogue for videos, podcasts, games, IOT, and accessibility. The more sophisticated ones are multilingual, and will let you clone or adjust speech patterns to match specific tones, emotions, accents and style.
Let's look at the market!
Text-to-speech (TTS) systems have been around for decades, but their wall-e grade shortcomings only enabled niche enterprise usecases. However, the last few years saw research breakthroughs like WaveNet and Tacotron 2 (google) which made voices sound natural, while papers like FastSpeech (microsoft) sped up synthesis. This was followed by advancements in voice cloning and better control over prosody (intonation, pitch, rhythm).
Today, in the post-ChatGPT world, projects like XTTS, StyleTTS2, and OpenVoice have made high-quality, multilingual, customizable AI voices accessible to the long tail market, opening up possibilities in gaming, entertainment, and more:
Presently, phrases like “ai voice generator”, “text to speech ai”, “voice maker”, and “text to voice” get between 100k to 1M monthly searches each with medium to low ad competition (source: Google Keyword Planner).
While Big Tech’s busy with broad platform APIs, a wave of fresh players are coming up with tailored SaaS across gaming, entertainment, education, and more. ElevenLabs (2022) and Murf.ai (2020) stood out for me as the coolest; with realistic, multilingual, and customizable voices. Priced at about $30/month for creators and $100/month for businesses, they’ve both attracted millions of users.
Alright, so how can we build this quickly?
Modern voice generation pipelines have many moving parts so I'll break it down step by step without getting too detailed. Starting with the input, the user uploads some text, an optional voice sample for cloning, and optional tags to control style and prosody. The text gets turned into phonemes (those pronunciation symbols in dictionaries), the voice sample helps generate speaker embeddings (a representation of unique vocal features), and the style and prosody tags help control emotional tone, pace, intonation and accent.
The system then generates intermediate acoustic representation of the voice using style and speaker encoding. Style encoding interprets and applies the style tags to the voice (using techniques like style diffusion), while speaker encoding ensures the voice sounds like the provided sample. Finally, speech synthesis combines all these elements to create an acoustic representation of the voice, which is then turned into the output soundwave!
Here are some of the best open source implementations to execute this pipeline:
StyleTTS 2 by Yinghao Aaron Li et al.
OpenVoice by MyShell
CosyVoice 2 by Alibaba Group
XTTS Toolkit by Coqui
Worried about building signups, user management, payments, etc.? Here are my go-to open-source SaaS boilerplates that include everything you need out of the box:
SaaS Boilerplate by Remi Wg
Open SaaS by wasp-lang
How will my SaaS stand out in the noise?
Here are a few strategies that could help you differentiate and achieve product market fit (based on the pivot principles from The Lean Startup by Eric Ries):
Personalize your UX for a niche audience: Design and personalize your offering for a specific market. This could mean voice generation and translation for educators, content creators, advertisers, or game developers. Alternatively, target specific regions or industries with unique requirements for language and speaking style.
Make this a differentiator for your larger Product: You could use this tech to voice-enable an existing product or service. Examples include Call Center AI, Dubbing platforms, voice assistants, podcast editors (more about this in future issues), and more.
Add unique features to increase switching cost: Examples of sticky features are unique language support, industry specific voices (eg. NPC speaking styles for gaming), and API access.
Offer platform level advantages: If you ship a native desktop app with a local, non api-driven, deployment; then privacy could become a big selling factor and attract higher licensing fees.
TMI?
I’m an ex-AI engineer and product lead, so don’t hesitate to reach out with any questions!
P.S. I started this free weekly newsletter to share open-source / turnkey resources for recreating popular products (like this one). If you’re a founder looking to launch your next product without reinventing the wheel, please subscribe :)