Book a Mission Brief

January 18, 2025

15 min read

Why the Next War Will Be Won in Simulation

There is a debate happening in enterprise AI that misses the point entirely.

[[divider]]

‍

1. In the last three years, something shifted.

‍

Large language models crossed a threshold where they generate reasoning that's hard to distinguish from human experts. Not always right, but coherent, contextual, and fast. You can now spin up a thousand agents that think through problems the way people do, run them in parallel, and watch what emerges.

‍

This happened because companies wanted better chatbots and search engines. The defense applications were an afterthought, or not a thought at all.

‍

But here's what those capabilities actually enable: you can simulate how an adversary's command structure might respond to a particular move. Not one guess, but ten thousand variations, each with slightly different assumptions about risk tolerance, domestic political pressure, intelligence picture, and commander personality. You can model how a civilian population responds to an operation, how information spreads through social networks, where the tipping points are. You can find the scenarios where everything goes wrong in ways nobody on your planning staff imagined.

‍

The country that builds this well will make better decisions under uncertainty than the country that doesn't. That advantage compounds. Better decisions lead to better outcomes lead to more resources lead to better decisions.

‍

This isn't a marginal improvement. It's a structural advantage in the thing that matters most: not being surprised.

‍

[[divider]]

‍

2. Everyone in defense knows this, at least abstractly.

‍

The 2023 Defense Science Board report said it clearly. DARPA has programs. The service labs publish papers. The primes put it in their pitch decks.

‍

And yet.

‍

Here's what actually happens: A research lab builds something impressive in a controlled environment. They publish a paper. A prime reads the paper, builds a demo, shows it to a program manager. The program manager writes a requirements document. The requirements document describes a system that will exist in five years. The five year timeline slips. The research moves on. The demo gets stale. Everyone stays busy. Nothing ships.

‍

I've watched this cycle for years. The gap between what's possible in research and what's deployed in operations is not shrinking. If anything, it's growing, because the research is accelerating and the acquisition system isn't.

‍

The problem isn't that nobody's working on this. Lots of people are working on this. The problem is that the work is fragmented across organizations that don't talk to each other, optimizing for metrics that don't matter, building systems that won't integrate with anything.

‍

The gap is not technical. The gap is organizational. Someone has to own the whole problem: take the research, make it work at scale, connect it to operational questions, and deliver something a commander would actually use.

‍

[[divider]]

‍

3. Let me be specific about what's hard.

‍

First: adversary modeling. The whole point of simulation is to face an opponent that surprises you, that exploits weaknesses you didn't know you had. If the simulated adversary thinks like you do, you learn nothing. You're just playing chess against yourself.

‍

Real adversaries have different value functions. A Chinese theater commander operates under constraints that have nothing to do with Western military logic: party politics, career incentives, a different theory of escalation, a different information environment. Russian doctrine treats certain thresholds differently than we do. Iranian decision-making is shaped by regime survival logic that doesn't map onto our frameworks.

‍

Building agents that actually embody these differences, that don't just have Chinese names but make decisions the way Chinese commanders make decisions, is hard. It requires area expertise, intelligence analysis, and careful validation against historical cases. Most simulations skip this and end up with American officers in Chinese uniforms.

‍

Second: population dynamics. Military operations happen in human terrain. Civilian response to operations matters enormously: for legitimacy, for intelligence, for the political sustainability of the campaign. But populations under wartime stress don't behave like consumers. Fear spreads differently than brand preferences. A single incident can cascade into mass behavior change in ways that are non-linear and hard to predict.

‍

The mathematics of this are genuinely different from market research. Small samples don't generalize. Historical patterns break. The system is path-dependent in ways that make prediction hard and preparation essential.

‍

Third: environment fidelity. Most simulation failures aren't agent failures. They're environment failures. The simulated world doesn't push back hard enough. Logistics work too smoothly. Communications don't degrade. Sensors don't fail. The friction that Clausewitz wrote about, the thousand small things that make real operations hard, gets abstracted away.

‍

And then the simulation tells you something will work, and it doesn't work, and you've learned the wrong lessons.

‍

Fourth: validation. How do you know if a simulation is teaching you something true? You can't run the real operation to check. You can compare against historical cases, but history doesn't repeat cleanly. You can consult experts, but experts have biases and blind spots.

‍

The honest answer is that you can't know for certain. What you can do is stress-test aggressively: red team the assumptions, vary the parameters, look for fragility. If the simulation's conclusions are robust across a wide range of reasonable assumptions, you can have some confidence. If they flip based on small changes, you know you're on thin ice.

‍

This demands a kind of epistemic humility that's uncomfortable for organizations that want clear answers.

‍

[[divider]]

‍

4. The good news is the research has made real progress.

‍

The WarAgent work out of Georgia Tech showed that LLM-based agents, given historical context and country-specific prompts, reproduce escalation dynamics that match actual historical conflicts. The agents in their World War I simulation made moves that historians recognized as plausible, and the emergent interactions produced paths to war that looked like the real paths to war.

‍

This isn't prediction. It's scenario generation at scale. Instead of a handful of scenarios dreamed up by planners, you get thousands, with the agents finding corner cases humans missed.

‍

The Naval Postgraduate School work on hierarchical architectures solved a problem that's blocked this field for years: how do you scale agents to operational complexity without the state space exploding? Their answer is the same way militaries actually work: nested hierarchies, with different levels of abstraction at each layer. Strategic agents set objectives. Operational agents allocate resources. Tactical agents execute. Each layer only sees what it needs to see.

‍

Frazer-Nash, working with BAE, showed that multi-agent reinforcement learning can discover tactics that beat human planners. Their AI red teams achieved win rates above 85% against notionally superior forces by finding approaches humans hadn't considered. Some of these were novel. Some were rediscoveries of historical tactics that had been forgotten.

‍

The pieces are there. What's missing is integration.

‍

[[divider]]

‍

5. We started RLTX because we kept seeing the same pattern:

‍

Organizations with urgent problems, research that could solve them, and no one connecting the two.

‍

Our model is the mission. A client comes to us with a specific problem. Maybe they're deploying an AI system into a high-stakes environment and need to know where it will fail. Maybe they're planning an operation and need to see scenarios their staff hasn't imagined. Maybe they're choosing between capabilities and need an honest comparative assessment.

‍

We scope the mission, assemble the team, run the work, and deliver something that changes decisions. Not a report that sits on a shelf. Not a demo that impresses visitors. Something that actually gets used.

‍

We've done this for frontier AI labs that needed to stress-test agent systems before deployment. For defense programs that needed evaluation frameworks that would actually find failures. For organizations that needed answers in weeks instead of years.

‍

The multi-agent simulation problem is the next mission. The research exists. The need is obvious. The integration work is what we're built for.

‍

[[divider]]

‍

If you're working on this, at a lab or a prime or a program office or a combatant command, we'd like to talk.

Not to pitch. To compare notes. To hear what you're seeing, what's working, what isn't. To figure out if there's a mission worth running together.

The war before the war is already being fought.The question is whether we're in it.

RLTX builds AI systems for teams where failure isn't an option. We design missions, assemble researchers and engineers, and deliver wargaming, evaluations, and mission software that works.

More Insights