How developers are changing video creation once again with AI

Unlike most AI researchers might’ve predicted a year ago, the first set of jobs changing drastically due to AI are the creative ones. Most recently, Sora from OpenAI has taken quite a bit of mindshare of researchers and creatives but there are many other AI-powered tools that are changing the way creators work. In this post, I hope to discuss the first time computers changed the way we create video and how AI is making history repeat itself. While film/video production is an important piece of history, we will mostly focus on how computer were used in the creative process in this article.

History

The first digital animations (1960s)

John Whitney Sr. was an American animator who was recognized for his contributions to some of the first, most promising works with digital animation using analog computers. An analog computer he built with his brother allowed him to make the demo reel “Catalog” in 1961 which set a new standard for motion graphics in the advertising and entertainment industries.

The demo below is another one from 1968 where he describes the creative process of a musician or an artist typically being “realtime” while each image in this sequence took 3-6 seconds to generate — meaning ~30 minutes for a 20 second clip.

Pixar (1975-present)

Only about a decade after, Industrial Light and Magic was founded by George Lucas to work closely with the Graphics Group at Lucasfilm. Steve Jobs then bought the Graphics Group from Lucasfilm in 1986 and renamed it to Pixar Studios, who then launched “Toy Story” a decade later. This was when motion graphics went truly mainstream.

Pixar developed a proprietary CAPS (Computer Animation Production System) that replaced the expensive process of transferring animated drawings to cels using India ink or xerographic technology. Instead, you could color easily within a digital environment using an unlimited palette and other more sophisticated painting techniques that were not previously available.

Present Day

Bringing us to where the industry is today, the progress has been incredible. In a few minutes, you’re able to generate completely synthetic videos that look quite lifelike. Below is an OpenAI Sora demo.

While this is extremely cool, the types of content people create has expanded from just films to YouTube videos, podcasts, TikToks, and more. And there is a whole wave of tools changing the way each of these types of content are created with AI. Below, I’ll highlight some that have gotten popular past text-to-video which has gotten the most mainstream attention.

Improving Traditional Video Editing

Products like Premiere Pro and Final Cut Pro first and web-based products VEED, Kapwing, and InVideo next started introducing basic features like automated background removal, audio noise removal, and more. While these seem simple, they sped up the process of an individual creator who previously would’ve had to setup expensive green screens, studio mics, etc.

Players like Runway came next with their AI-based Rotoscoping product that took this a step further and let users have more granular control over what a user could keep versus remove.

Edit by Transcript

While most products were focused on traditional content, new ones like Descript came along to introduce a different UX around video editing. They focused on podcasters and built their own in-house audio transcription AI models (similar to Whisper from OpenAI). Quite often with content like podcasts, the subject might say a word or two wrong, pause for too long, or have unwanted background noise that comes with inexpensive production setups. Descript tried to solve this.

Descript Editor

Imagine if you could edit the video like a word doc, where if you said something wrong — you could just replace it with the right word? Descript made this possible through various text-to-speech techniques similar to Meta’s Voicebox that let users do this and it worked like magic.

Descript Overdub

Now new open-source models like VoiceCraft are making it possible for other products to integrate similar capabilities.

Content Repurposing

2020 was the year when short-form video really took over, mostly through TikTok. Short-form video like TikToks are edited very differently than traditional content because they’re typically under 30 seconds in duration, vertical, and you want to grab the users attention relatively quickly without much build up. Products like Opus were created out of this trend, having realized that they could automate quite a bit of this editing process.

They would find ways to extract the most interesting moments from long-form content, use various models that could detect who was actively speaking in the videos, add captions, and focus the main subjects — all completely automatically. Now more tools like this are being created with apps like our autocrop API and models like TalkNet and YOLO.

AI Avatars

Humans are pretty centric to most content people produce. Recording ourselves perfectly is also quite a barrier to creating high-quality videos where you want to make sure you’re saying the correct things, positioned in the right lighting, etc. Models like Wav2Lip were some of the first that opened eyes to the fact that you could “sync the lips” of a person to a “target audio” which meant you wouldn’t have to manually re-record yourself many times as long as you did it just once. HeyGen took this a step further with their AI avatars that were the first to be almost indistinguishable from real humans. The video below is a demo of their tech that sync the lips of this person to a translated version of the original audio.

The use cases of being able to do this are applicable across a variety of content production settings from dubbing content to marketing, sales, and training videos you might want to produce. Open-source models like video retalking are starting to get further but still not close to HeyGen’s quality bar. Approaches like Amazon’s LipNERF are likely closer to what HeyGen is using under the hood and I expect open-source models will catch up to enable more customizability and use cases past current commercial offerings.

Other tech

There’s also a ton of other interesting capabilities that I haven’t listed that you should check out. Imagine if you didn’t have to look at the camera when recording a video, and you could correct your “Eye Contact” after the fact? This is what NVIDIA’s Broadcast AI does.

What if you could dub content completely automatically through a combination of AI video transcription, translation, text-to-speech, and lipsyncing? This is what our beta dubbing app does.

Or how about being able to automatically organize content libraries through descriptive audiovisual summaries of videos through a combination of AI video transcription and visual captioning models? This is what our video summary AI, Describe, does!

In this video, outside a brick building amidst greenery and a cobblestone path, a man wearing a blue "Xchanging" sweatshirt is seen holding a tray, possibly containing tea. He's approached by someone off-camera, persistently asking if he regrets his comments and offering him a cup of tea. Despite the repeated question, the man in the sweatshirt politely declines to comment, stating he's there on a humanitarian mission to offer tea to the patient interviewer, expressing sympathy but firmly noting he has nothing to say on the matter. The exchange ends with a thanks, highlighting the man's attempt to deflect the questioning with kindness.

Conclusion

At Sieve, we’re excited about everything happening in video creation with fresh research being applied to real problems so quickly. The use cases we discuss here are just the tip of the iceberg and we’re excited to build tools that enable others to keep putting these kinds of capabilities into real products. Check out our explore page which highlights some of the stuff we already make available out of the box, or build your own apps with the lower-level tools we offer around combining models and deploying custom apps.