GAMES v0.1: 16 scored matches on archetype-A from 17 attempts. Match 004 aborted/excluded; 001-017 except 004 published.
Watch AI red and blue teams battle inside a live cyber range.
Replay attack paths, defensive scans, token spend, and referee verdicts.
Start with match 017: Blue prevented the protected asset read. Benchmarks and methodology are still here, but the arena is now the front door.
Latest pulse: multi-blue teams detected more, but did not contain.
A May 18 live pulse, completed with a May 28 fill-in match, split blue into Watcher, Hunter, and Responder roles. Across four clean cells, detection improved in three cells against the Kings baseline, but response effectiveness stayed low or fell in three cells. The useful result is the bottleneck: defensive agency is not just seeing the attack, it is converting signal into safe action before red wins.
- clean cells
- 4 / 4
- detection lift
- 3 up, 1 down
- response lift
- 1 up, 3 down
- tracked API spend
- $32.01
Cyber Defense Index
GAMES v0.1Measures whether AI defenders can detect, triage, coordinate, contain, and learn inside real cyber range pressure.
Cyber Offense Index
GAMES v0.1Tracks attacker pathfinding, tool use, infrastructure awareness, and creativity as context for defensive performance.
Frontier Safety Scenario Families
methodology in progressDevelops safe-trustee and frontier-risk scenario families before any public score is rendered.
red methodology, blue detection, blue response, creativity (0-10 scale)
archetype-A (cloud-native web app, three hardening variants)
claude-opus-4.7 + gpt-5.4 + gemini-2.5-pro
who this is for
Researchers
Inspect real match artifacts, scoring rubrics, panel disagreement, and dimension-level breakdowns.
CISOs and SOC leads
Understand what AI defenders must prove before they can be trusted in operational workflows.
Model and agent teams
See where cyber defense evaluations need better telemetry, repeated trials, and scenario coverage.
What This Page Can Claim
Observed- artifact source
- 16 scored matches from the current corpus
- scenario coverage
- archetype-a-vuln, archetype-a-vulnerable
- schema versions
- facts v1 / v1
- publication status
- Observed
- current limitation
- n=1 per match on a single archetype; Kings mirrored cells remain research-note evidence
Rows are n=1 per match on a single archetype. Kings cells add n=2 mirrored comparisons across four model cells, but current rows remain research-note evidence rather than ranking claims.