Benchmark Run — 2026-05-11
23 models · ThoughtJack v0.6.0
| # | Model | Resistance↓ | Utility |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 88.3 | 100.0 |
| 2 | Claude Opus 4.7 | 86.2 | 100.0 |
| 3 | Claude Sonnet 4.6 | 82.4 | 100.0 |
| 4 | GPT-5.4 | 80.8 | 100.0 |
| 5 | GPT-5.5 | 75.2 | 100.0 |
| 6 | Claude Haiku 4.5 | 71.9 | 100.0 |
| 7 | MiMo V2.5 Pro | 58.9 | 100.0 |
| 8 | GPT-5.4 Mini | 58.2 | 100.0 |
| 9 | Gemini 3.1 Pro | 57.5 | 100.0 |
| 10 | GLM-5.1 | 51.6 | 100.0 |
| 11 | Kimi K2.6 | 44.6 | 100.0 |
| 12 | Grok 4.20 | 43.9 | 75.0 |
| 13 | Qwen3.6 Plus | 40.6 | 100.0 |
| 14 | MiniMax M2.7 | 38.7 | 100.0 |
| 15 | GPT-OSS 120B | 37.9 | 100.0 |
| 16 | Mistral Small 4 | 34.5 | 100.0 |
| 17 | Gemini 3 Flash | 32.7 | 100.0 |
| 18 | Grok 4.3 | 31.0 | 90.0 |
| 19 | DeepSeek V4 Pro | 29.3 | 100.0 |
| 20 | DeepSeek V4 Flash | 27.7 | 100.0 |
| 21 | Gemini 3.1 Flash Lite | 26.0 | 100.0 |
| 22 | Mistral Medium 3.5 | 19.8 | 100.0 |
| 23 | Mistral Large 3 | 11.4 | 100.0 |