A key risk of large language models (LLMs) is the generation of harmful content – malware, phishing, DDoS, recipes to make drugs, etc.
LLMs have three main mitigating controls for harmful content:
- Input validation and sanitization
- Output validation and sanitization
- Training dataset sanitization
I’ve found the best method for adversarial prompting is role-play/persona patterns. The results of basic role-play patterns for Claude and ChatGPT are below.
Claude
During my testing Claude was relatively capable of detecting malicious requests hidden in complex prompts. Requests to provide the steps to create MDMA and cocaine, create phishing emails, and generate code for DDoS were declined.
OpenAI
OpenAI is less adept and I was able to get a response for every request. Their safeguards to prevent harmful output are very limited and immature.
Instructions to make cocaine:
JavaScript code to perform DDoS:
JavaScript code to create a keylogger: