State of the art audio enhancement in 5 minutes
Learn how we developed a quality AI audio enhancement app with open-source, rivaling the best APIs in the market. Try it for yourself!
/blog-assets/authors/abhiu.jpg
by Abhi Upadhyay
Cover Image for State of the art audio enhancement in 5 minutes

In the world of multimedia and podcasting, user experience often hinges on audio quality. Superior quality has the capacity to captivate audiences, encouraging prolonged engagement with your content, whether they be podcasts, tutorials, or promos. More specifically, audio quality is not just about enhancing user experience but also plays a crucial role in ensuring the accuracy of downstream AI tasks such as automated transcription services, like transcription with OpenAI's Whisper. In fact, an entire industry of startups like Descript, Krisp AI, and more are working in the space.

With the help of a few open source models, namely AudioSR and DeepFilterNet, we’ve launched an audio enhancement app that helps remove background noise and enhance speech, making sure your audio quality is always up to par.

Here’s a quick example of the sound quality difference using open-source models:

In this post, we’ll go through a quick demonstration of how you might integrate this solution into your AI project and some background on the models we used.

Trying it out

You can try out the pre-built audio enhancement app in a few clicks here with your own audio samples here. Here’s a few more samples from podcasts and YouTube videos:

Sample 1

Original
Enhanced

Sample 2

Original
Enhanced

Run via API or Python

You can also integrate the app into your current workflow through an API call or Python call with the following steps:

  1. Sign up for a Sieve account and find your API key here.

  2. Run audio enhancement via API (or see below for Python)

    curl -X POST https://mango.sievedata.com/v2/push \
    -H "X-API-Key: <your-api-key>" \
    -d '{
      "function": "sieve/audio_enhancement",
      "inputs": {
        "audio": {
          "url": "<your-audio-url>"
        }
      }
    }'
  3. Run via Python client

    • Install the Python client
    pip install sievedata
    
    sieve login
    • Run this Python script with your own audio!
    import sieve
    
    audio_enhancer = sieve.function.get('sieve/audio_enhancement')
    
    # Specify "upsample", "noise" or "all" for the filtering type
    enhanced_audio = audio_enhancer.run(sieve.Audio(path="./speech.wav"), "upsample")
    
    # View results on Sieve dashboard or locally from this path
    print(enhanced_audio.path)

It’s as simple as that! Results will now be viewable on your Sieve dashboard or directly from your Python code.

How it works

The magic lies in the tandem of two open-source AI models - AudioSR and DeepFilterNet.

AudioSR

AudioSR is a generative model that uses a diffusion-based approach to estimate the high-frequency components of a low-resolution audio signal. It does this by training a latent diffusion model to learn the conditional generation of high-resolution spectrograms from low-resolution spectrograms. The model can handle a flexible input sampling rate between 4kHz and 32kHz, covering most real-world scenarios. AudioSR has achieved promising results on speech, music, and sound effects with different input sampling rate settings and has been verified to be a plug-and-play module for enhancing the audio quality of various audio generation models.

DeepFilterNet

DeepFilterNet is a deep learning-based speech enhancement framework that utilizes harmonic structure of speech to efficiently enhance speech quality by removing unwanted noise from audio files. It operates in two stages, with the first stage enhancing the speech envelope in the ERB domain, and the second stage using deep filtering to enhance the periodic component. Several optimizations have been made to the training procedure, data augmentation, and network structure resulting in state-of-the-art speech enhancement performance while reducing processing, making it applicable to run on embedded devices in real-time.

Other Solutions

Let's take a look at how open source stands against the other prominent ones in the market, namely the Dolby Enhance API:

Original

Dolby

AudioSR + DeepFilterNet

Very promising results from a purely open-source solution!

What next?

Audio enhancement is a feature that can be used upstream of most other AI audio functionality. For example, a feature like speech editing (also featured on Sieve) can use audio enhancement capabilities to enhance the quality of the generated voices.

Sieve's cloud platform makes combining functionality in this way easy. To try it, create an account!