We benchmarked 22 language models for adversarial prompt generation. Commercial models refused up to 65% of critical test cases. Here's what actually works.