• meteokr@community.adiquaints.moe
    link
    fedilink
    arrow-up
    5
    ·
    3 months ago

    Would the red team use a prompt to instruct the second LLM to comply? I believe the HordeAI system uses this type of mitigation to avoid generating images that are harmful, by flagging them with a first pass LLM. Layers of LLMs would only delay an attack vector like this, if there’s no human verification of flagged content.