Benchmark Run — 2026-05-11

23 models · ThoughtJack v0.6.0

Scenario Set

2.1

Scenario Commit

Runs per Scenario

Duration

#	Model	Provider	Resistance↓	Utility	Inj	Exfil	Priv	Instr	Info	SocEng	Avail
1	Claude Opus 4.6	Anthropic	88.3	100.0	100.0	91.7	86.7	66.7	73.3	100.0	100.0
2	Claude Opus 4.7	Anthropic	86.2	100.0	93.3	91.7	86.7	66.7	73.3	100.0	91.7
3	Claude Sonnet 4.6	Anthropic	82.4	100.0	86.7	91.7	86.7	66.7	53.3	100.0	91.7
4	GPT-5.4	OpenAI	80.8	100.0	93.3	62.5	86.7	50.0	73.3	100.0	100.0
5	GPT-5.5	OpenAI	75.2	100.0	93.3	66.7	46.7	66.7	53.3	100.0	100.0
6	Claude Haiku 4.5	Anthropic	71.9	100.0	93.3	50.0	86.7	66.7	40.0	66.7	100.0
7	MiMo V2.5 Pro	OpenRouter	58.9	100.0	66.7	37.5	46.7	66.7	53.3	66.7	75.0
8	GPT-5.4 Mini	OpenAI	58.2	100.0	66.7	37.5	46.7	50.0	40.0	66.7	100.0
9	Gemini 3.1 Pro	Google	57.5	100.0	80.0	37.5	46.7	50.0	46.7	66.7	75.0
10	GLM-5.1	OpenRouter	51.6	100.0	66.7	37.5	46.7	50.0	26.7	66.7	66.7
11	Kimi K2.6	OpenRouter	44.6	100.0	60.0	37.5	26.7	50.0	13.3	33.3	91.7
12	Grok 4.20	xAI	43.9	75.0	0.0	12.5	46.7	50.0	40.0	66.7	91.7
13	Qwen3.6 Plus	OpenRouter	40.6	100.0	40.0	12.5	20.0	50.0	20.0	66.7	75.0
14	MiniMax M2.7	OpenRouter	38.7	100.0	53.3	12.5	40.0	50.0	6.7	33.3	75.0
15	GPT-OSS 120B	OpenRouter	37.9	100.0	33.3	25.0	6.7	50.0	33.3	66.7	50.0
16	Mistral Small 4	OpenRouter	34.5	100.0	46.7	0.0	20.0	66.7	33.3	33.3	41.7
17	Gemini 3 Flash	Google	32.7	100.0	40.0	12.5	46.7	50.0	46.7	33.3	0.0
18	Grok 4.3	xAI	31.0	90.0	40.0	0.0	20.0	50.0	6.7	33.3	66.7
19	DeepSeek V4 Pro	OpenRouter	29.3	100.0	0.0	25.0	0.0	50.0	13.3	66.7	50.0
20	DeepSeek V4 Flash	OpenRouter	27.7	100.0	40.0	12.5	0.0	50.0	0.0	66.7	25.0
21	Gemini 3.1 Flash Lite	Google	26.0	100.0	0.0	0.0	46.7	0.0	26.7	33.3	75.0
22	Mistral Medium 3.5	OpenRouter	19.8	100.0	60.0	0.0	46.7	0.0	6.7	0.0	25.0
23	Mistral Large 3	OpenRouter	11.4	100.0	26.7	0.0	0.0	0.0	53.3	0.0	0.0