Building realistic video AI avatars in an hour from scratch
Learn about our process building a Twitter AI bot that can generate avatar videos and responses in minutes using Sieve.
by Abhi Upadhyay
Cover Image for Building realistic video AI avatars in an hour from scratch
Note: This blog is outdated. If you're interested in a production-ready app for this capability, please check out the video retalking app on Sieve.

In the past few months, we’ve seen the rise of all sorts of generative AI tools, including GPT-3, Stable Diffusion, DALL-E, Meta AI's Make-A-Video and much more. Specifically, AI video generation companies like Synthesia and have shown us the power in generating AI avatars, for use cases ranging from customer support tools to innocent Twitter trolling.

This past weekend, we set out to build a workflow that could generate avatar AI videos using as many open-source models as possible, given an avatar image and a few voice samples. We launched our workflow as a Twitter AI bot that can mimic others (Stephen A. Smith, Mark Zuckerberg, and Sam Altman) and answer all of Twitter’s burning questions in video form. You can check out the latest responses here, or go ahead and @ them on Twitter yourself!

In this blog post, we’ll be going through how to make an AI generated avatar video using a variety of open-source models. If you’d like to just try out our template workflow and start generating videos in minutes, skip down below.

Lip-syncing audio to video

The first critical step in generating AI avatar videos is making the avatar’s mouth sync up to the provided audio. Luckily, there’s a popular open source model, Wav2Lip, that helps with this.

Wav2Lip is a state-of-the-art lip-syncing technology that uses deep neural networks to generate realistic lip movements from an input audio signal. It consists of two main components: an audio encoder and a lip-sync decoder. The audio encoder extracts features from the provided audio signal that help the lip-sync decoder, a convolutional neural net, predict lip movements.

We also found that Wav2Lip gives us much better results when we input a video of just the cropped face, so we preprocess our videos by running an open-source face detector (MediaPipe) and tracking those faces with a SORT implementation.

You can find our full implementation of Wav2Lip here, with its associated Sieve workflow.

To generate a video of a person saying something else, you can run Wav2Lip on the reference video along with a generated audio track. However, as you’ll notice below — the results don’t look that amazing. The mouth is a bit blurry and feels a bit unnatural. We'd ideally want Wav2Lip to look a bit higher resolution.

There are a couple approaches we could take to make Wav2Lip generate higher resolution videos by retraining the base model but due to the lengthy process that is, we take a slightly different approach. We take our Wav2Lip output video and feed it to a separate model that was trained to “animate” a source image to look like the “driving video”. As we’ll see in the next section, this makes the lips look a lot clearer!

Animating avatar image

To animate an image and mimic our reference video (generated by Wav2Lip), we can use the recently released open-source model, the Thin-Plate Spline Motion model.

The Thin-Plate Spline motion model performs motion transfer from a “driving video” (in our case, the Wav2Lip video) and a static image. This model works by producing optical flow matrices, or the transformation from one frame to the next, and applying those transformations on the source image. Additionally, to smooth out the resulting images, the model uses multi-resolution occlusion masks to fill in missing parts of the avatar image animation.

Check out our full implementation of the model here. After just these 2 steps using open-source models, we’re able to generate avatar videos that look somewhat realistic.

Aside: Twitter bot / voice cloning

To round our project out for our Twitter bot, we wanted to experiment with voice cloning with a few samples. The most promising open-source model was Tortoise-TTS. However, the results still ended up being subpar and sounded more robotic than we had hoped.

Instead, we opted for a more complete and custom solution: Eleven Labs. With just a few samples, we’re able to clone voices and generate scripts with an API call and the results were pretty incredible:

Generate Avatars!

Now that we have some background on the models we used, let’s generate some avatar videos!

  1. To follow along, you’ll need to sign up for a Sieve account and get your Sieve API key. You'll also need to install the Sieve CLI:
pip install
  1. Write your workflow! Open up a new file in a new folder: avatar_gen/ and copy the following code:
import sieve
def avatar_generation(driving_video: sieve.Video, driving_audio: sieve.Audio, avatar: sieve.Image) -> sieve.Video:
    images = sieve.reference("sieve-developer/video-splitter")(driving_video)
    faces = sieve.reference("sieve-developer/mediapipe-face-detector")(images)
    tracked_faces = sieve.reference("sieve-developer/sort")(faces)
    synced = sieve.reference("sieve-developer/wav2lip")(driving_video, driving_audio, tracked_faces)
    return sieve.reference("sieve-developer/thin_plate_spline_motion_model")(avatar, synced)

This workflow uses Sieve building blocks (sieve.reference), so we can skip all the boilerplate code and the nuances with each model, empowering us to build features like avatar generation fast. If you want to dive deeper into the implementation of each of these models, check out our examples repo.

  1. Deploy to Sieve. From the avatar_gen directory, run sieve deploy. deploy output
  2. Generate your first video!

What next?

To integrate your own AI avatar videos into your app in a few minutes, you can follow the guide above or clone our template from the UI and access any of your job outputs via our API. We offer $20 of credit, so you can get started right away.

In addition to pre-built building blocks and models, Sieve deploys your custom models at scale, so you don’t have to worry about infra and can focus on building your video AI features fast. If you have any models you would love to see on Sieve’s platform or have any questions about custom use cases, please don’t hesitate to reach out.

© Copyright 2024. All rights reserved.