Video
Audio
Image
Text
Pipelines
Models
Generating
Editing
Understanding
Utilities
Popular production ready functions for scale
Translate any video or audio content with natural sounding translations and voices.
sieve
/
dubbing
Smart, automatic cropping of a video to a given aspect ratio based on subject detection and speaker tracking.
autocrop
High quality background removal for images and videos.
background-removal
A comprehensive solution for video lipsyncing with a suite of different model and enhancements options.
lipsync
State-of-the-art audio-visual active speaker detection based on new, efficent face and speaker detection models.
active_speaker_det...
Filters for removing background noise, enhancing speech, and more in audio files.
audio-enhance
Fast, high quality speech transcription with many available backends, word-level timestamps, speaker diarization, and translation capabilities.
transcribe
Correct eye contact in videos by redirecting the eyes to look at the camera.
eye-contact-correc...
Generate and render video or audio highlights for long-form content based on search phrases.
highlights
Given a video or audio, generate a title, chapters, summary, tags, and highlights.
transcript-analysi...
A set of text-to-speech models and tooling that helps generate natural-sounding speech, clone voices, control emotions, access word timestamps, and more.
tts
Moderate videos and images for harmful content.
visual-moderation
LivePortrait is a video-driven portrait animation system that can animate a portrait video using another driving video. It can also retarget facial expressions from one image to another.
liveportrait
An active speaker detection model to detect which people are speaking in a video.
talknet-asd
A highly customizable text moderation tool that combines AI and algorithmic methods to detect and manage harmful, inappropriate, or unwanted content in real-time.
text-moderation
Download YouTube videos in MP4 format at any resolution.
youtube_to_mp4
Reliable customizable text translation supporting 200+ languages through complex tokenization and sentence splitting.
translate
YOLOv8 real-time object detection model with COCO, face, and world variants.
yolov8
Generate depth maps from images or videos.
depth-anything-v2
High-quality speech recognition using major improvements on top of Whisper
whisper
Generate a portrait avatar from a source image and driving audio with multiple backends and enhancement options.
portrait-avatar
This is an optimized implementation of Segment Anything 2, a model that can dynamically segment objects in an image or video.
sam2
A visual language foundation model that can perform a variety of image and video question-answer tasks, such as object detection, image captioning, segmentation, and OCR.
florence-2
A comprehensive visual question answering app that integrates image and video analysis with text-based queries to provide accurate, structured, and context-aware responses.
visual-qa
MuseTalk is a lip-sync model that generates realistic talking faces from audio input.
musetalk
A diffusion-based audio-driven portrait animation model
echomimic
CodeFormer is a face restoration model that can restore low-resolution faces to high-resolution faces.
codeformer
Models for human-centric segmentation, depth estimation, and surface normal estimation. Supports video and image inputs.
sapiens
Demucs is a state-of-the-art music source separation model, currently capable of separating drums, bass, and vocals from the rest of the accompaniment.
demucs
Count OpenAI tokens with tiktoken, manage limits and costs!
tiktoken
An optimized version of VideoReTalking, an audio-based lip synchronization model for talking head video editing in the wild.
video_retalking
Detect scene changes in a video with PySceneDetect.
pyscenedetect
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
whisperx
Detect faces in image and video with MediaPipe.
mediapipe_face_det...
Diarize audio using pyannote-audio
pyannote-diarizati...
Translate text into 96 different languages
seamless_text2text
Resemble Enhance is an AI-powered tool that aims to improve the overall quality of speech by performing denoising and enhancement
resemble-enhance
Detect the language of a text with LangID, a lightweight language identification tool
langid