Improving on open-source for fast, high-quality AI lipsyncing
We discuss modifying current lipsyncing solutions such as OpenRetalker’s Video Retalking to get a performant, production-ready lipsyncing solution.
/blog-assets/authors/abhi.jpg
by Abhinav Ayalur
Cover Image for Improving on open-source for fast, high-quality AI lipsyncing
Note: Learn how we built a production-grade lipsyncing solution below, or try it out here for free on Sieve.

Realistic overlaying of audio onto a single-speaker (or “talking-head”) video, also known as video lipsyncing, retalking, or dubbing, is a powerful feature of Generative AI that allows for videos to be edited with slightly different audio, dubbed into different languages, or even create hilarious TikToks. Here's a fun one we made given the present OpenAI / Microsoft saga:

In today’s age of targeted media content, changing videos slightly to fit a specific demographic can allow companies to have a higher impact on different niche audiences. One recent example of this was when Cadbury put out an ad campaign featuring famous Indian movie star Shah Rukh Khan, which realistically dubbed his voice to speak many different languages.

Several papers, like LipGANWav2LipVideo Retalking, have outlined different approaches to building a model or set of models to yield convincing results, yet they have a few issues preventing them from being deployed to production.

LipGAN and Wav2Lip are somewhat unstable in several scenarios, making the lips move in unnatural ways and not blending them properly to match the expression of the rest of the face. Video Retalking is much better at making the lips loop natural and realistic. However, it is quite slow, with the naive solution taking up to 1 minute to process a single second of audio and video. It also degrades the quality of the entire video, producing small artifacts and introducing some graininess. In addition, it can fail with lower-quality input data, which we often encounter in production scenarios.

We modified the Video Retalking repository to address some of these points, developing our own algorithm to run up to 5x faster while performing at higher quality. Let's outline how this works, from the original implementation to the key improvements we’ve made.

Original Implementation

Let’s go through a brief summary of the original implementation in the Video Retalking repository.

Video Retalking Implementation

First, we extract facial landmarks from the original video. Next, we stabilize each frame of the original video with an expression using a model known as D-Net, which allows us to better align facial features later to. We also separately obtain the mask of the mouth around the original video.

Then, we concatenate these and obtain new landmarks, which we pass into a network with encoded audio features to generate a low-resolution version of the lips. We enhance the mouth region using GPEN and GFPGAN, two generative face restoration algorithms, which allow the mouth and teeth to look more realistic, along with using Laplacian blending to seamlessly merge the newly generated mouth with the original video.

Video Retalking on its own is pretty decent. However it can take up to a minute to generate a second of video, adds some unnecessary artifacts and blurriness in the video, and doesn’t handle certain edge cases properly, particularly when there are frames with no faces detected, the face is low resolution, or the face is close up.

Our Improvements

To improve this, we’ve introduced a series of optimizations on the original repository that greatly improves speed and performance.

The first optimization is smartly cropping around the face of the target speaker to avoid unnecessarily processing most of the video. Along with the ML models, there are a lot of computer vision operations like warping, inverse transforms, etc. in Video Retalking that are expensive to perform on the entire face. We quickly identify the target speaker using batched RetinaFace, a very lightweight face detector. In many scenarios, there are multiple faces or even multiple predictions of the same face that occur, so we have to isolate the largest face. For now, we treat that as the target speaker. Then, we crop the video around the union of all detections of the face. This allows us to process a much smaller subsection of the video, which greatly speeds up inference up to 4x faster, especially on videos where the face is smaller and doesn’t move a lot. In addition, establishing the target speaker crop allows us to solely enhance that part of the video, rather than potentially generating artifacts around other sections of the frame.

Second, we added batching to the stabilization step, making this step much faster when combined with the cropping above. We also removed enhancement of the stabilized video, as we found that its inclusion did not affect quality after we performed the cropping above.

When detecting facial landmarks, the original repository reinitialized the keypoint extractor multiple times and performed duplicate computations of landmarks during multiple steps of the process due to input resizing. We initialize the keypoint extractor once, and allow landmarks calculated before to be resized and reused during facial alignment. On low resolution inputs where the face is really small, we bypass parts of the alignment that actually made the output look worse, as the feature detection was much less accurate.

Finally, we made the code more durable to edge cases where no faces are detected (by ignoring these frames), more than one face is detected (by detecting the largest face), or there is lots of movement from the speaker (by being smart about cropping).

In addition, we’ve optimized memory and GPU memory throughout the code so that it can fit on an L4 GPU with 8 vCPUs and 32 GB RAM, making it very cost-effective. We also added a low resolution and low fps option to allow for up to an additional 4x speedup for scenarios where speed matters more than quality.

Comparisons

Let’s compare our results on a test case with the outputs of Wav2Lip, Video Retalking, and a closed source lipsyncing API.

Original Video: Here is a source video of a person speaking.

Audio to Lipsync: Here is the audio we want to lipsync to the video.


Wav2Lip Output (https://github.com/Rudrabha/Wav2Lip)

Notes

  • Quick generation speed. Took 10 seconds to generate.
  • Output is extremely blurry and contains artifacts
  • Lips are relatively in sync

Closed-Source Lipsync API Output

Notes

  • Slow generation speed. Took 7 minutes to generate.
  • Higher quality background
  • Unblended crop (notice the box around the speaker’s face)
  • Inconsistent mouth movement, especially towards the end

VideoRetalking Output (https://github.com/OpenTalker/video-retalking)

Notes

  • Slow generation speed. Took 6.5 minutes to generate.
  • Better mouth movement
  • Speaker’s face is grainy, which is especially noticeable when he blinks and moves his head around.

Sieve's Lipsyncing Solution Output (to be open-sourced)

Notes

  • Relatively faster generation speed. Took 2.1 minutes to generate.
  • Stable, realistic mouth movement

Conclusion

With our lipsyncing solution on Sieve, we are able to beat other solutions by generating a stable lipsync that doesn’t affect the quality of the rest of the video, at a rate of just 10 seconds for 1 second of input video. Below, we’ll outline how you can try it out for free today.

Trying it out

You can try out the pre-built fast retalking model in a couple steps with your own audio and video samples here.

In addition, you can also integrate the model via Python client or REST API using the steps here.

Future Work

There are a few ways we want to improve our solution to allow for more portability, faster inference, and better quality. We want to experiment with different landmark detection schemes that might improve the quality of alignment on lower-resolution videos. In addition, we want to improve the mouth enhancement so that it can support 4k resolution videos at high quality. We also want to try to speed up model inference by quantizing and pruning some of the networks and by modifying their input resolutions.

We are in the process of cleaning the codebase and open-sourcing the repository, so be on the lookout for our code release early next week! We’re also going to be releasing an article about our optimized version of video dubbing into different languages, which builds on top of this post.

Sieve's cloud platform makes combining functionality from models like this with other solutions for text-to-speech, translation, etc extremely simple. To try it, create an account or check out our explore page.

© Copyright 2024. All rights reserved.