Transforming YouTube Videos into NotebookLM-like Conversational Avatars

Transforming a video into a podcast-style conversation between two people is a powerful way to repurpose content, enhance storytelling, and engage your audience. NotebookLM from Google showed us the power of this. With Sieve, you can achieve this effortlessly in just a few steps.

In this tutorial, I'll walk you through how I built a function that:

Downloads a YouTube video
Summarizes its content into a conversational dialogue between two people
Generates audio for the summary text
Generates talking avatars to narrate the speech

Benefits and Use Cases

Convert lengthy videos into conversational narratives
Summarization saves time while retaining the core message
Perfect for storytelling, education, or marketing
Makes content more engaging with avatars

Building the app from scratch

Install Sieve

First, sign up and get your API key. Then install the Python client and log in.

pip install sievedata
sieve login

Run the project

git clone https://github.com/sieve-community/video2dialogue
cd video2dialogue
python pipeline.py

You should see logs streaming in the console. You can also monitor the job's status on the Sieve dashboard. Once the pipeline completes, a video file path will be printed to the console.

Steps by step explanation

Let's go through each step of the pipeline in detail alongside its corresponding code for better clarity and understanding.

Download the YouTube video

We can do this using the sieve/youtube_to_mp4 function.

# Initialize remote Sieve functions
youtube_to_mp4 = sieve.function.get("sieve/youtube_to_mp4")
visual_summarizer = sieve.function.get("sieve/visual-qa")
tts = sieve.function.get("sieve/tts")
portrait_avatar = sieve.function.get("sieve/portrait-avatar")

# Step 1: Download YouTube video
print("Downloading YouTube video...")
youtube_video = youtube_to_mp4.run(youtube_url, "highest-available", True)
print("Download complete")

Generate dialogue based on the video

Sieve's Visual-QA function can generate summary of the input video by using a well-crafted prompt. Since we need the summary in a conversational style, we use the following prompt.

"Summarize the video into a conversation between two people. Denote the first speaker as 'Person 1' and the second speaker as 'Person 2'."

# Step 2: Generate conversational summary
print("Summarizing video...")
summary_prompt = "Summarize the video into a conversation between two people. Denote first speaker as 'Person 1' and second speaker as 'Person 2'."
function_json = {
    "type": "list",
    "items": {
        "type": "object",
        "properties": {
            "speaker_name": {
                "type": "string",
                "description": "The speaker name"
            },
            "dialogue": {
                "type": "string",
                "description": "dialogue"
            }
        }
    }
}
summary_as_conversation = visual_summarizer.run(
    youtube_video,
    "gemini-1.5-flash",
    summary_prompt,
    fps=1,
    audio_context=True,
    function_json=function_json
)
print("Summary generated")

Generate speech and avatars

Now, we will process each part of the conversation by first turning text into speech using Sieve’s TTS function and then taking that speech to create a talking head avatar by leveraging Sieve’s portrait-avatar function.

# Step 3: Generate speech and avatar videos for each dialogue turn
print("Generating speech and avatar videos...")
reference_audio = sieve.File(url="")  # Required placeholder

avatar_videos = []
for entry_num, entry in enumerate(summary_as_conversation):
    if entry['speaker_name'] == "Person 1":
        voice = voice1
        image = image1
    elif entry['speaker_name'] == "Person 2":
        voice = voice2
        image = image2
    else:
        raise ValueError(f"Unknown speaker: {entry['speaker_name']}")

    target_audio_future = tts.push(voice, entry['dialogue'], reference_audio, "curiosity")
    avatar_video_future = portrait_avatar.push(
        source_image=image,
        driving_audio=target_audio_future.result(),
        aspect_ratio="1:1"
    )
    avatar_videos.append(avatar_video_future)

Re-encode and merge

Finally, we can re-encode the videos to ensure consistent frame rate and merge them into a single output video.

for entry_num, avatar_video_future in enumerate(avatar_videos):
    try:
        avatar_video = avatar_video_future.result()
        normalized_video = f"normalized_{entry_num}.mp4"
        reencode_video(avatar_video.path, normalized_video)
        avatar_videos[entry_num] = normalized_video
    except Exception as e:
        print(f"Error processing turn {entry_num}: {e}")
        raise e

    # Step 4: Merge avatar videos in sequence
    output_path = 'output_avatar_video.mp4'
    merge_videos(avatar_videos, output_path)
    print('Video generation complete')

Overview of the Sieve Functions Used

Project Task	Function
downloads a YouTube video	sieve/youtube_to_mp4
summarizes it into a conversational style	sieve/visual-qa
converts the summary into conversation speech between two people	sieve/tts
make talking avatars speaking that speech	sieve/portrait-avatar
Join generated videos	ffmpeg concat

Output Preview

Here is a video generated using the script described above, with this YouTube video about Thanksgiving as input: url.

Future Enhancements

While the current implementation effectively transforms YouTube videos into engaging conversations between avatars, there are several opportunities to expand and enhance its functionality:

Integration of ChatTTS Models: Incorporate conversational TTS models to generate the speech for the entire conversation text in one step, eliminating the need to process individual turns iteratively.
Conversational Talking-Listening Heads: Introduce avatars that not only talk but also exhibit listening behaviors, such as nodding or maintaining eye contact. This feature would allow the creation of videos where avatars appear to converse naturally within the same frame. See: conversational head generation challenge.
Language Support: Add support for multiple languages to cater to a global audience.

Conclusion

In this tutorial, we explored how to transform YouTube videos into engaging conversational avatars using Sieve's powerful suite of functions. By combining video processing, text summarization, text-to-speech, and avatar generation capabilities, we've created a streamlined pipeline that opens up new possibilities for content repurposing. Whether you're a content creator, educator, or developer, this approach offers an innovative way to present information in a more dynamic and engaging format. You can get started with the pipeline here, or book a demo with our team to learn more.