Adding Sound Effects to Stock Videos with AI
In this post, we build an app that adds sound effects to stock videos using vision language models and audio generation models.
by Mokshith Voodarla
Cover Image for Adding Sound Effects to Stock Videos with AI

Our team recently saw the new “sound effect” feature from Pika Labs and found it extremely fun to play with. So we decided to make our own version! Basically the goal is to take an input 5-10 second clip, and generated the corresponding sounds that match.

Generally we think something like this could be useful for folks trying to add sound effects to stock video content. Instead of using stock audio and sound effect libraries like Storyblocks, Adobe, or PremiumBeat — why not generate it with AI?

Adding appropriate sound effects to videos means being able to understand what’s going on in the video, what it could sound like, and then generating appropriate sounds given all of that context.

How it Works

We’ve seen multi-modal AI models like OpenAI’s GPT-4 take in simple text prompts to generate images. This is done by getting the LLM to first expand the text prompt, which it then feeds into DALLE 3. Similarly, our idea is to do the following:

  • extract some frames in the stock video
  • get a model to describe what that video is or might sound like
  • get a model to generate audio using that description

And turns out there’s some great open-source models that allow us to do just this. Vision language models (VLMs) are starting to get really good. These models can take in an image along with a prompt and then respond to it based on the image. We have a couple of these available on Sieve such as CogVLM, InternLM, and Moondream — each of which come with a certain cost / quality tradeoff. For this app, we decided to pick CogVLM which we found to be really descriptive. Here’s how we use Sieve to prompt it.

import sieve

image = sieve.Image(path="./some_image.jpg")
prompt = "describe what you might hear in this image in detail."

cogvlm_chat = sieve.function.get("sieve/cogvlm-chat")
output =, prompt)


We then take this prompt and feed it into AudioLDM, a state-of-the-art audio generation model.

audioldm = sieve.function.get("sieve/audioldm")
audio =

In the case of video, we decide to simply sample the middle frame to pass into our VLM though there could be more robust approaches to this using video-native models in the future. One all of this is complete, we’re able to create videos like these. All of the sound you hear was generated with AI!

You can try the app for yourself here and we’ve open-source all the code here so you can modify it however you like.

© Copyright 2024. All rights reserved.