Designing Voice customisation features for Murf AI that drove 3x boost in free to paid conversions

This case study explains how building features that saved time for creators to generate right AI voiceovers resulted in a 3x jump in free to paid conversions.

Date

January 2025

Role

Product Designer

Company

Murf

What's up with Murf?

Murf is an AI voice infrastructure provider. There are 3 main vertical for Murf - Text to Speech (TTS) Studio, Murf Dubbing, and API offerings. Each cater to different customer segments - ICPs for TTS studio was Learning and Development teams in Enterprises. My role at Murf was mostly defining the experience for TTS Studio

What is the current issues with voiceovers?

For most of our users, the generated voiceovers works right out of the box with minimal manual interventions like getting the pronunciations right for certain proper nouns, exploring the right speed and pitch - to get that final ready to publish output.

But these outputs can sometimes be monotonous and sound very neutral for certain use cases, and the creators had to spend more time in customising them, which certainly makes the entire process less effective and few workflows redundant. There are also cases when certain parts of a sentence is in a different language and the generated voiceover spoke that part in non-native accent. This also created trouble for our users who have multi-lingual scripts

It was really important that the user finds the right voiceover quickly without lot of manual intervention. In order to tackle the above issues, we started brainstorming on features that could save users time and get to the right voiceover, as fast as possible.

Neutral or monotonous outputs
Lot of manual interventions to get the right output
Multi-lingual scripts not being generated properly

Part one: Variability

Thought Process

From talking to users we acknowledged that the set of users who really want multiple variations of a voiceover are significant, mostly advertisers and marketers who want to add some quirkiness into the voiceovers.

From our research we also discovered users are looking for options - they want to hear monotonous and expressive voiceovers for their content. Decision to choose an expressive voiceover is depended on the use case, but the user base in general would like to listen to options.

With the new model we were launching (Gen 2), we realised we can generate multiple variations of a sentence with unique prosody, emotion or feel. We brainstormed with research and product team in understanding how to present this solution to our users, and also spoke to few of our users regarding the same.

Solution

Variability works on a scale of 1 to 5. Users can set the values of variability on a block and sub-block level. If the user sets the value close to 1, the generated voiceovers will be mostly adhere to the default values set. As the value gets close to 5, the generations become more expressive and unique, letting the user explore more possibilities.

Discovery in Block (Each of the text blocks is called Blocks which is further broken down into sub-blocks)

Discovery in Sub-Block

If the creator wants to generate an entire block - all sub-blocks within it - to be generated in a consistent way, they can set a lesser value - 0 or 1 - for Variability on a block level. All sub-block generations within it will have very less delta between them in terms of emotions and prosody.

If the creator want to explore a different emotion or dialogue delivery for a specific sub-block, they could go and change the variability value for just that sub-block, in which case we show an identifier with the changed number in the area.

Maximum value of this feature exists within sub-block generation, where creators can generate multiple variations of a text sub-block at once.

This has proved to be most useful for creators working in advertising and marketing departments where they want really expressive and unique voiceovers.

Creators can preview the generations and compare them. If they like one of the options, they can save it to refer back later.

They can also change the Variability values and generate multiple variations. Once they find what works for them the best, they can apply it to the main block.

In the iterations before coming up with the final solution, we really tried having a visual representation of a generation. Hypothesis was, if we could have a consistent visual for each generations, creators can have a glance at them and figure out how they are different from each other without really hearing every generations. Even though we had some strong POC, we decided not to go ahead with it, at-least in the first version.
We explored showing the contours of the generations in a way to have visual feedback as well but, from user testing and talking to the team, we realised this is not that helpful.

Part Two: Say It My Way

Thought process

While talking to creators using Murf Studio, we noticed a pattern among a set of users who know exactly how a voiceover should sound like. They either have a sample or can record themselves articulating the exact voiceover. This is slightly advanced feature for advertisers and marketing departments where there is a need for more expressive voiceovers. These are creators who find it very time consuming to edit the voiceover to match the exact version they have planned for a voiceover.

So we realised there is a need to take in a reference clip from the user, and Murf will match the voiceover with that reference with better voice and clarity.

Solution we came up with

Through SIMW, users can either record a voiceover themselves or upload a recording of the script, and model will transfer the emotion on to the generation, retaining the same feel and prosody. In this way, if a marketing team knows exactly what they are looking for, they could achieve that outcome with ease

Since SIMW is one another route through which you an arrive at the desired voiceover, we decided to keep its entry closer the main Generate CTA, besides the Variability feature.

Our entire experience for this feature was around a hypothesis that users would want to record themselves more than uploading, so wanted to provide a studio like experience. We decided to show the content in sub-block in big fonts here so users can read it aloud to record themselves.

On clicking the "Record" CTA, the mic would just ease in a little, signifying the start of the recording along with an audio que. They could also upload a clip, if they have a clip of someone else recording.

Once the voiceover is ready, user can preview it, and apply the voiceover if they are content.

They could also see the transcription of the recorded/uploaded clip, and they could make edits to the transcription of they need.

Once a voiceover from SIMW is applied to the block, we decided to have an identifier to denote the same.

Parth Three: Change Locale

Thought process

Some of our users with global presence often had script that are multi-lingual. The most common use case was having Spanish phrases in English sentences and they often get spoken in non-Spanish accent. This did not work for lot of our customers.

We brainstormed with our research team to understand the capabilities and came up with a solution. For each sub-block we had a default language that is set up by the user, and with the new model launch it was possible to change the accent for specific parts of that sentence.

Solution we came up with

Users can just select a specific word or a phrase that needs to have a native accent, and they would see the list of supported accents or locales supported

Once the locale is changed for a phrase, we highlighted the same and also added an identifier to convey the change.

For cases where a specific word need to be spoken in a different locale, users can just select the word and open the pronunciation window.

Word specific locales can be set here - this usually works in cases where there is a brand name or nouns that needs to be pronounced in a different locale.

We highlighted this change in locale, in the selected state as well.

How did these perform?

We bundled all of these features and termed them Gen 2 Features - as they were launched along with our new model. While all free users can try out these features for a month, we made them available only for paid plans. After 30 days, free users would get a nudge to upgrade. With the launch of these features, the conversion have increased 2x from free to paid, as these features seemed helpful to our users.