Text Moderation (Alpha Release)

Sieve's text moderation provides a robust solution for moderating text in real-time, designed to identify harmful, inappropriate, or unwanted content across various categories such as bullying, hate speech, sexual content, and more. The app is highly customizable and supports custom words, word indices for filtering or censoring content, and additional classes giving developers the ability to adapt moderation to their specific needs.

Note: This is an experimental Alpha release. Features may change, and some functionality might not yet be fully stable.

Key Features

  • Advanced Moderation: Leverages a combination of AI and algorithmic approaches to detect harmful content across multiple categories such as bullying, sexual exploitation, and violence.
  • Additional Classes: Enables developers to choose specific content categories for moderation beyond the default classes. By selecting from the additional classes list, users can tailor the moderation to their platform's needs. For instance, if a developer wants to moderate political discussions, they can easily add and moderate that category, providing full control over the moderation process.
  • Filters: Enables filtering capabilities for detecting and handling sensitive information such as phone numbers and addresses. Developers can select which filters to apply, and the app will return the precise location (start and end index) of the detected items within the text. This feature allows for targeted content management and protection of personal information. More details on filter usage and output can be found in the Filters section.
  • Custom Words: Enables developers to provide a list of community specific words to be filtered. More information on custom words can be found here.
  • Contextual Understanding: Understands the intent behind messages, even when they aren't explicitly sensitive. It understands context, such as combinations of emojis or subtle language cues that may imply inappropriate content, like sexually suggestive messages, ensuring a more nuanced approach to moderation.
  • High Performance: Designed for speed and scalability, the app can efficiently handle millions of messages in real-time, ensuring fast moderation without compromising accuracy.

Future Work

  • CSV/Text File support Support for csv and text files is coming soon.

Pricing

Number of CharactersPrice
1 million characters$0.5

Default Moderation Classes

By default, the app moderates messages for the following moderation classes:

Safety FlagLabelDescription
SexualSClassify Text for sexually explicit or suggestive content
ViolenceVFlag Content containing extreme threats of violence
BullyingBIdentify bullying or abusive language in real-time
HateHDetect hate speech with high levels of granularity
SpamSPMark language designed to take you to a different platform as spam
DrugsDFlag text that discusses or promotes the sale, possession, or usage of drugs
Child ExploitationCEIdentify content that mentions or explicitly alludes to child sexual exploitation
Child SafetyCSDetect threats of physical violence targeted at children in a school or school-related setting
GibberishGMark keyboard spam and phrases or words that are completely incomprehensible as gibberish
Phone NumbersPNDetect phone numbers in message strings, including international formats
PromotionsPRIdentify promotional content that redirects to another platform or requests an action such as reposting, donating, etc.
WeaponsWContent that mentions knives, guns, personal weapons, and accessories such as ammunition, holsters, etc.

Scoring

The scoring factor indicates the severity of the message. Some classes have multiple scores (0, 1, 2, 3), while others, such as spam, are binary (0, 3).

Scoring TypeClass NamesValid Scores
Non-BinarySexual, Hate, Violence, Bullying, Drugs, Weapons0, 1, 2, 3
BinaryCustom Words, Child Exploitation, Child Safety, Self Harm, Gibberish, Spam, Promotions, Redirection, Phone Numbers0, 3

Note: Non-binary classes are scored from 0 to 3, with higher scores indicating more severe content. Binary classes are scored as 0 (no violation) or 3 (violation detected).

Examples

The app flags messages based on the detected content, if a message may contain multiple moderation classes, it returns each of these classes with a severity score:

[
  {
    "classes": [
      {
        "class": "bullying",
        "score": 2
      }
    ]
  },
  {
    "classes": [
      {
        "class": "bullying",
        "score": 2
      },
      {
        "class": "sexual_exploitation",
        "score": 3
      }
    ]
  }
]

Additional Classes

If a developer wants to moderate more classes in addition to the default classes, they can use the following additional classes. To include any of the additional classes to be a part of moderation results, simply pass a list of safety flag or labels from the additional classes to the additional_classes parameter.

Safety FlagLabelsDescription
Death, Harm & TragedyDHTHuman deaths, tragedies, accidents, disasters, and self-harm.
Public SafetyPSServices and organizations that provide relief and ensure public safety.
HealthHLHuman health, including: Health conditions, diseases, disorders medical therapies, medication, vaccination, medical practices, and resources for healing, including support groups.
Religion & BeliefRBBelief systems that deal with the possibility of supernatural laws and beings: religion, faith, belief, spiritual practice, churches, and places of worship. Includes astrology and the occult.
War & ConflictWCWar, military conflicts, and major physical conflicts involving large numbers of people. Includes discussion of military services, even if not directly related to a war or conflict.
FinanceFConsumer and business financial services, such as banking, loans, credit, investing, and insurance.
PoliticsPPolitical news and media; discussions of social, governmental, and public policy.
LegalLLaw-related content, including law firms, legal information, primary legal materials, paralegal services, legal publications and technology, expert witnesses, litigation consultants, and other legal service providers.

Note: Please note that any safety flag or label that may not be part of the table when added to the additional_classes list will result in an exception.

Scoring

Scoring TypeClass NamesValid Scores
BinaryDeath, Harm & Tragedy, Weapons, Public Safety, Health, Religion & Belief, War & Conflict, Finance, Politics, Legal0, 3
Non-Binary--

Note: Each class is scored based on the severity of the content. For binary classes, scores are either 0 (no mention) or 3 (explicit mention).

Notes

Filters

The filters parameter enables granular content filtering by returning the start and end index of detected words or patterns. This feature allows developers to precisely identify and manage potentially harmful or sensitive content. Available filter options include:

  • None: No filtering applied
  • all: Apply all available filters
  • profanity: Detect profane language
  • phone-numbers: Identify phone number patterns
  • phone-numbers-and-addresses: Detect both phone numbers and address formats

Developers can choose one of these options to tailor the filtering process to their specific needs.

The filter functionality is robust enough to detect obfuscated words, such as "f*ck" or "@ss", ensuring that attempts to bypass the filter are still caught.

Example usage:

{
  "filters": [
    {
      "value": "f*ck",
      "class": "profanity",
      "start_index": 0,
      "end_index": 4
    },
  ]
}

Custom Words

To filter any community specific words, provide the list of words to the custom_words parameter. If the words were found they are returned under the filters key with type custom. Please note that custom_words parameter only works if filters are enabled.

{
  "filters": [
    {
      "value": "L-rizz",
      "type": "custom",
      "start_index": 42,
      "end_index": 49
    },
  ]
}