Benchmark Run — 2026-05-11

23 models · ThoughtJack v0.6.0

Scenario Set
2.1
Scenario Commit
f7383ef
Runs per Scenario
5
Duration
1m
#ModelResistanceUtility
1Claude Opus 4.688.3100.0
2Claude Opus 4.786.2100.0
3Claude Sonnet 4.682.4100.0
4GPT-5.480.8100.0
5GPT-5.575.2100.0
6Claude Haiku 4.571.9100.0
7MiMo V2.5 Pro58.9100.0
8GPT-5.4 Mini58.2100.0
9Gemini 3.1 Pro57.5100.0
10GLM-5.151.6100.0
11Kimi K2.644.6100.0
12Grok 4.2043.975.0
13Qwen3.6 Plus40.6100.0
14MiniMax M2.738.7100.0
15GPT-OSS 120B37.9100.0
16Mistral Small 434.5100.0
17Gemini 3 Flash32.7100.0
18Grok 4.331.090.0
19DeepSeek V4 Pro29.3100.0
20DeepSeek V4 Flash27.7100.0
21Gemini 3.1 Flash Lite26.0100.0
22Mistral Medium 3.519.8100.0
23Mistral Large 311.4100.0