Job TreeNavigate the job tree to view your child job details
Loading job tree...
Smart, automatic cropping of a video to a given aspect ratio based on subject detection and speaker tracking.
waiting for outputs
listening for logs...


This app takes an input video and automatically crops it to a specified aspect ratio based on smart subject tracking, speaker detection, and more.

Autocrop currently works best for speaking videos.

  • ✅ Podcasts
  • ✅ Commentaries
  • ✅ Product Reviews
  • ✅ Educational Videos
  • ✅ Single Speaker Speeches
  • ❌ Crowds of People or Busy Background Scenes
  • ❌ Sports games
  • ❌ Vlogs
  • ❌ Music Videos
  • ❌ Online Gaming

Some key features include:

  • Subject tracking: The app tracks the subjects in the video (currently limited to people) and crops the video to keep them in frame.
  • Speaker detection: The app can detect who is speaking in the video and crop the video to focus on them.
  • Dynamic layout: The app can dynamically choose between layouts of 1, 2, 3, or 4 subjects at a time.
  • Automatic aspect ratio: The app can automatically crop the video to a specified aspect ratio.


  • file: The input video file.
  • active_speaker_detection: Whether to use active speaker detection to crop the video to the active speaker.
  • aspect_ratio: The aspect ratio to crop the video to.
  • return_video: Whether to return the resulting video or just the metadata required to crop the video.
  • include_subjects: Whether or not to include the information about the subjects (people, faces, etc) of the crops in the metadata. Defaults to False.
  • include_non_active_layouts: whether or not to include the versions of the layouts if speaker detection was off in the output metadata. This is useful if you want the option to switch between active speaker and non-active speaker layouts without having to reprocess the video. Defaults to False and is only used if active_speaker_detection is True and return_video is False.
  • prompt: Experimental feature. Currently limited to the classes listed at the bottom of this README. Soon to support any natural language prompt. Defaults to

Metadata Output Format

When return_video is set to false, the app will return the following metadata in JSON format per frame as it processes the video:

  • frame_number: The frame number of the video.
  • frame_width: The width of the video frame.
  • frame_height: The height of the video frame.
  • crops: An array of crop objects, each containing:
    • x1: The x-coordinate of the top left corner of the crop.
    • y1: The y-coordinate of the top left corner of the crop.
    • x2: The x-coordinate of the bottom right corner of the crop.
    • y2: The y-coordinate of the bottom right corner of the crop.
    • apply_letterbox: A boolean value indicating whether to apply a black border around the frame. This is used when it's decided that it's best not to crop in certain scenarios and instead show the whole frame with borders.

The crops are keyed by the aspect ratio of the crop. For example, if the aspect ratio is 9:16, the crop would be keyed by "9:16". If you turn on active speaker detection, the key would be "9:16-active-speaker" instead.

Prompt Usage (Experimental)

Currently, the app works best on people-related content but we are starting to support prompts using the prompt and negative_prompt parameters. prompt can be a comma-separated list of classes to look for in the video described in natural language. For example, "a person speaking" or "a person talking". negative_prompt can be a comma-separated list of classes to avoid in the video. For example, "a person not speaking" or "a person not talking". The app will then use the prompt to determine the best layout and crop for the video.

smart_edit is a pre-set use of prompt and negative_prompt with the following classes:

  • prompt: the subject of focus, the most important thing in the image, the main person speaking, the main object in the scene, the object that stands out the most
  • negative_prompt: background object, logo, small object, blurry object, backdrop graphic, background graphic, a large crowd, blurry graphic, news headline graphic, the back of a head, crowded area, table, set prop, a person whose face isn't visible, a person in a crowd, logo graphic

Known Limitations

  • The app currently performs poorly when there are large crowds of people. Think scenes such as political rallies with people behind the speaker, large audience crowds, busy streets, etc.
  • The app works best when there are 3 or fewer subjects in the frame. While with 4+ subjects, the app may still work, it may not be as stable.
  • Speaker detection works best when speakers are closer to the camera. Far away speakers may not always be classified as active speakers.