Transforming a video into a podcast-style conversation between two people is a powerful way to repurpose content, enhance storytelling, and engage your audience. NotebookLM from Google showed us the power of this. With Sieve, you can achieve this effortlessly in just a few steps.
In this tutorial, I'll walk you through how I built a function that:
- Downloads a YouTube video
- Summarizes its content into a conversational dialogue between two people
- Generates audio for the summary text
- Generates talking avatars to narrate the speech
Benefits and Use Cases
- Convert lengthy videos into conversational narratives
- Summarization saves time while retaining the core message
- Perfect for storytelling, education, or marketing
- Makes content more engaging with avatars
Building the app from scratch
Install Sieve
First, sign up and get your API key. Then install the Python client and log in.
pip install sievedata
sieve login
Run the project
git clone https://github.com/sieve-community/video2dialogue
cd video2dialogue
python pipeline.py
You should see logs streaming in the console. You can also monitor the job's status on the Sieve dashboard. Once the pipeline completes, a video file path will be printed to the console.
Steps by step explanation
Let's go through each step of the pipeline in detail alongside its corresponding code for better clarity and understanding.
Download the YouTube video
We can do this using the sieve/youtube_to_mp4 function.
# Initialize remote Sieve functions
youtube_to_mp4 = sieve.function.get("sieve/youtube_to_mp4")
visual_summarizer = sieve.function.get("sieve/visual-qa")
tts = sieve.function.get("sieve/tts")
portrait_avatar = sieve.function.get("sieve/portrait-avatar")
# Step 1: Download YouTube video
print("Downloading YouTube video...")
youtube_video = youtube_to_mp4.run(youtube_url, "highest-available", True)
print("Download complete")
Generate dialogue based on the video
Sieve's Visual-QA function can generate summary of the input video by using a well-crafted prompt. Since we need the summary in a conversational style, we use the following prompt.
"Summarize the video into a conversation between two people. Denote the first speaker as 'Person 1' and the second speaker as 'Person 2'."
# Step 2: Generate conversational summary
print("Summarizing video...")
summary_prompt = "Summarize the video into a conversation between two people. Denote first speaker as 'Person 1' and second speaker as 'Person 2'."
function_json = {
"type": "list",
"items": {
"type": "object",
"properties": {
"speaker_name": {
"type": "string",
"description": "The speaker name"
},
"dialogue": {
"type": "string",
"description": "dialogue"
}
}
}
}
summary_as_conversation = visual_summarizer.run(
youtube_video,
"gemini-1.5-flash",
summary_prompt,
fps=1,
audio_context=True,
function_json=function_json
)
print("Summary generated")
Generate speech and avatars
Now, we will process each part of the conversation by first turning text into speech using Sieve’s TTS function and then taking that speech to create a talking head avatar by leveraging Sieve’s portrait-avatar function.
# Step 3: Generate speech and avatar videos for each dialogue turn
print("Generating speech and avatar videos...")
reference_audio = sieve.File(url="") # Required placeholder
avatar_videos = []
for entry_num, entry in enumerate(summary_as_conversation):
if entry['speaker_name'] == "Person 1":
voice = voice1
image = image1
elif entry['speaker_name'] == "Person 2":
voice = voice2
image = image2
else:
raise ValueError(f"Unknown speaker: {entry['speaker_name']}")
target_audio_future = tts.push(voice, entry['dialogue'], reference_audio, "curiosity")
avatar_video_future = portrait_avatar.push(
source_image=image,
driving_audio=target_audio_future.result(),
aspect_ratio="1:1"
)
avatar_videos.append(avatar_video_future)
Re-encode and merge
Finally, we can re-encode the videos to ensure consistent frame rate and merge them into a single output video.
for entry_num, avatar_video_future in enumerate(avatar_videos):
try:
avatar_video = avatar_video_future.result()
normalized_video = f"normalized_{entry_num}.mp4"
reencode_video(avatar_video.path, normalized_video)
avatar_videos[entry_num] = normalized_video
except Exception as e:
print(f"Error processing turn {entry_num}: {e}")
raise e
# Step 4: Merge avatar videos in sequence
output_path = 'output_avatar_video.mp4'
merge_videos(avatar_videos, output_path)
print('Video generation complete')
Overview of the Sieve Functions Used
Project Task | Function |
---|---|
downloads a YouTube video | sieve/youtube_to_mp4 |
summarizes it into a conversational style | sieve/visual-qa |
converts the summary into conversation speech between two people | sieve/tts |
make talking avatars speaking that speech | sieve/portrait-avatar |
Join generated videos | ffmpeg concat |
Output Preview
Here is a video generated using the script described above, with this YouTube video about Thanksgiving as input: url.
Future Enhancements
While the current implementation effectively transforms YouTube videos into engaging conversations between avatars, there are several opportunities to expand and enhance its functionality:
- Integration of ChatTTS Models: Incorporate conversational TTS models to generate the speech for the entire conversation text in one step, eliminating the need to process individual turns iteratively.
- Conversational Talking-Listening Heads: Introduce avatars that not only talk but also exhibit listening behaviors, such as nodding or maintaining eye contact. This feature would allow the creation of videos where avatars appear to converse naturally within the same frame. See: conversational head generation challenge.
- Language Support: Add support for multiple languages to cater to a global audience.
Conclusion
In this tutorial, we explored how to transform YouTube videos into engaging conversational avatars using Sieve's powerful suite of functions. By combining video processing, text summarization, text-to-speech, and avatar generation capabilities, we've created a streamlined pipeline that opens up new possibilities for content repurposing. Whether you're a content creator, educator, or developer, this approach offers an innovative way to present information in a more dynamic and engaging format. You can get started with the pipeline here, or book a demo with our team to learn more.