Book a Mission Brief

January 15, 2026

Why the Next War Will Be Simulated Before It Is Fought

‍

The age of wargaming as speculation is over. We are entering the age of wargaming as prediction, where large language models enable us to run the same geopolitical crisis thousands of times, observe where escalation is most likely to occur, and test interventions before a single shot is fired.

‍

The question is no longer whether we should simulate conflict with AI. The question is whether we can afford not to.

‍

[[divider]]

‍

The Case for Pre-War Simulation

‍

Wargaming has always been central to military planning. The Prussian General Staff used Kriegsspiele to rehearse the Schlieffen Plan before 1914. RAND analysts used analog games throughout the Cold War to stress-test nuclear strategy. But these exercises suffered from the same constraint that made them valuable: they required expert humans, and humans can only play so many hands before fatigue sets in. A typical military wargame runs a handful of iterations before conclusions are drawn. The result is high confidence in a narrow slice of outcome space.

‍

We now have the ability to change this fundamentally. Large language models can operate as autonomous nation-state agents, capable of reasoning about military doctrine, diplomatic signaling, economic coercion, and nuclear red lines. They can play thousands of rounds of the same scenario, generating distributions of outcomes rather than point predictions. More importantly, they can play scenarios that have never occurred in history, allowing us to explore counterfactuals: What if Germany had a different economic structure in 1938? What if communication channels were more robust in October 1962? What if both sides had asymmetric information about each other's resolve?

‍

The academic evidence for this approach has crystallized rapidly. Hogan and Brennen's Snow Globe framework (2024) introduced the first systematic method for automating qualitative wargames at scale, enabling researchers to run the same game dozens or hundreds of times and extract outcome distributions. Hua et al.'s WarAgent model (2024) went further, simulating country-level agents across historical scenarios from the Warring States period through World War II, demonstrating that LLM agents could reproduce historical decision patterns well enough to enable genuine counterfactual reasoning. These weren't toy models. They were systems that could be interrogated to understand why wars happened and, by extension, which conditions make wars more or less likely to occur.

‍

This matters because the speed of modern conflict has compressed the window for learning from mistakes. In the Cold War, crises unfolded over days or weeks, giving humans time to deliberate and de-escalate. The Ukraine conflict operates at the speed of drone footage and social media. A U.S.-China conflict over Taiwan would compress further still. We cannot afford to learn wargaming lessons in real time. We need to learn them now, in simulation, where the only casualty is an error in reasoning.

‍

[[divider]]

‍

The Escalation Problem We Can't Ignore

‍

But here is where the consensus breaks down, and where RLTX's architecture becomes essential: LLMs, when given the authority to make military decisions autonomously, escalate.

‍

The Stanford team led by Lamparth, Corso, and Schneider (2024) conducted a head-to-head comparison between 214 national security experts and GPT-3.5 and GPT-4 in a simulated U.S.-China Taiwan Strait crisis. The experts were vetted practitioners: diplomats, military officers, intelligence analysts. The models were state-of-the-art language models given the same scenario briefs and decision points. What they found was stark: the LLMs exhibited systematic escalatory bias. They were more aggressive in their military posturing, quicker to threaten force, and more prone to what the researchers called "farcical harmony" in team simulations, where the models would align on a consensus position without the deliberation and devil's advocacy that human teams naturally produce.

‍

This was not a fringe finding. Rivera et al. replicated and extended it across five different LLMs in a series of turn-based escalation games (2024). Every single model escalated over 14 turns, even when starting from neutral positions. Some exhibited arms-racing behavior, moving missiles forward or increasing alert levels without provocation. In one instance, a model deployed a nuclear weapon in a scenario where both sides had agreed to negotiate. The researchers were explicit: autonomous LLM decision-making in military contexts represents a serious escalation risk.

‍

The mechanisms are clearer now. Language models are trained to be helpful, harmless, and honest. But when prompted to roleplay as a nation-state decision-maker facing an existential threat, they interpret "helpful" as "effective at securing the nation's interests." They interpret "harmless" in a domain-specific way: harm to their simulated population, not harm in the form of global escalation. And "honest" means they'll say what they think a rational nation-state would do, without the friction that comes from tenure reviews, accountability to Congress, or the visceral knowledge of what nuclear weapons actually do to cities.

‍

The problem goes deeper. In real military decision-making, there are costs to being wrong. A general who escalates unnecessarily damages his career, his country's international standing, and potentially his own life. An LLM has none of these costs embedded in its objective function. It optimizes for what it predicts a rational actor would do under the stated conditions, without internalizing the asymmetric risk of escalation. Rational-actor theory, which is baked into most wargaming frameworks, is a useful model for understanding state behavior. But it is not sufficient, because humans are not purely rational actors. We are risk-averse when it comes to existential threats. Our rationality is bounded and shaped by historical experience.

‍

The Chinese research team behind WGSR-Bench (Yin et al., 2024) found that this gap in complex strategic reasoning is measurable and persistent. They developed a benchmark with 1,200+ question-answer pairs grounded in the S-POE cognitive framework, testing both LLMs and human experts on complex strategic scenarios. The performance gap was significant. LLMs struggled with scenarios that required long-term thinking, multi-step causal reasoning about opponent behavior, and the ability to distinguish between commitments that are credible and those that are bluffs. Drinkall's recent analysis at Oxford Internet Institute (2024) benchmarked GPT-4o, Gemini-2.5, and LLaMA-3.1 across 90 multi-turn crisis simulations using international humanitarian law grounded metrics, and found concerning patterns in targeting behavior and civilian casualty calculations.

‍

Here's why this matters for wargaming: if your simulation is generating unrepresentative escalation behavior, your conclusions about policy are biased. You are not predicting how real nation-states would behave; you are predicting how unanchored optimization algorithms would behave if given absolute power and no downside risk. That is not a useful baseline for strategic planning.

‍

[[divider]]

‍

Why Interventions Work, and Why They Matter

‍

The good news is that escalation bias is not a property of LLMs themselves. It is a property of how we deploy them. Elbaum and Panter (2024) demonstrated that two simple interventions, applied at the prompt level or scenario level, substantially reduce escalatory behavior without requiring any model retraining or architectural changes. By constraining the decision space, making consequences explicit, and ensuring that models engage with uncertainty honestly, you can move behavior closer to expert human baselines. The interventions are not fool-proof, but they work.

‍

This is where we see the outline of a defensible approach to LLM-based wargaming. You cannot ignore escalation risk by hoping it will go away. You cannot sidestep it by using a weaker model. You have to build it into your architecture from the ground up.

‍

At RLTX, our five-layer architecture is designed around this insight. Layer 1 is our natural language scenario builder, which lets analysts specify crisis conditions, initial positions, and constraints in human language rather than code. This layer feeds into Layer 2, our multi-agent behavioral engine, built on the Stanford Generative Agents paradigm. This layer represents each nation-state, faction, or organization as a set of explicit goals, constraints, capabilities, and decision rules. But here is the crucial part: these constraints are not optional or advisory. They are hard bounds on behavior. A nation-state agent cannot take an action that violates its stated doctrine or its political constraints, just as a real nation-state cannot credibly claim it will use nuclear weapons in a conventional conflict without undermining its own deterrence framework.

‍

Layer 3 is our Monte Carlo execution engine. We run the same scenario hundreds or thousands of times, with stochasticity injected at key decision points. This generates outcome distributions rather than point predictions. You see not just the most likely path, but the full range of possible outcomes and the conditions that lead to each one.

‍

Layer 4 is our analytics and insight layer, and this is where the causal reasoning happens. We do not just tell you that war occurred in 40% of runs. We trace the causal chain backward. We identify the specific decision points where the outcome was determined. We show you which intelligence gaps, misunderstandings, or commitment failures were most consequential. We show you where an alternative signal or a different constraint would have changed the trajectory. This is what separates high-value strategic wargaming from low-value pattern-matching.

‍

Layer 5 is our classified deployment capability. Once we have validated a simulation framework in the research environment, we can deploy it into classified networks where analysts can run scenarios with real operational data, real intelligence, and real classification markings. This is essential for defense applications, because the scenarios that matter most are also the ones you cannot discuss in open literature.

‍

[[divider]]

‍

The Architecture as Risk Management

‍

The key innovation in how we handle escalation is that it runs through all five layers. Layer 1 forces analysts to make constraints explicit during scenario design, rather than hoping they will be respected. Layer 2 encodes those constraints directly into agent decision rules, turning them from soft guidelines into hard limits. Layer 3 uses Monte Carlo sampling to find the edge cases where constraints might be violated, surfacing these to analysts before they become policy problems. Layer 4 analyzes why violations occur when they do, and whether they reflect genuine strategic incentives or artifacts of prompt phrasing. Layer 5 ensures that whatever we discover in research can actually be deployed where it is needed, without loss of fidelity or months of translation into different software systems.

‍

This is not a claim that we have solved the escalation problem. We have not. But we have structured our approach around the empirical reality that Lamparth et al. and Rivera et al. documented: escalation bias is not a peripheral issue or a curiosity for researchers to study. It is a core failure mode that any serious wargaming platform must address directly and systematically.

‍

The alternative is to keep using analog games, where a dozen experts play one iteration of a scenario and draw conclusions. There is value in analog wargaming; the human judgment and institutional knowledge are irreplaceable. But analog games do not scale. They are expensive. They take weeks to run. And they are blind to outcome space. If there is a critical scenario you did not think to game, you will not discover it until you face it in reality.

‍

[[divider]]

‍

Speed and Scale as Strategic Advantages

‍

The other reason simulation matters now is scale and speed. Black and Darken at the Naval Postgraduate School (2024) developed hierarchical reinforcement learning methods for wargaming complexity and made an explicit case to the Department of Defense that AI-enabled wargaming is not a luxury or a research project, but an investment priority. The reasoning is straightforward: in a fast-moving crisis, the decision-maker who has explored the most scenarios and tested the most interventions has an advantage. They have priors that are more accurate. They have identified critical decision points earlier. They know which signals will be misinterpreted and which will land effectively. They have already asked the hard questions.

‍

This is not theoretical. In the 2020 war between Azerbaijan and Armenia, the side that better understood drone tactics and had rehearsed novel applications of them had a decisive advantage. The conflict compressed into a few weeks and was largely determined by pre-war preparation and understanding. Neither side had time to learn from initial setbacks the way militaries did in World War I. The information environment was too transparent; once a tactic was revealed, it was globally known within hours. The only advantage was having thought through the problem space more thoroughly before the shooting started.

‍

We can now run 1,000 simulations of a Taiwan conflict in the time it takes to run one analog wargame. Each of those simulations can be instrumented to show you exactly where decision-makers are making different assumptions about the other side's intentions. Each can be analyzed to show which intelligence gaps are most consequential. Each can be used to test whether a specific policy change, a different communication, or a structural constraint changes the trajectory.

‍

The constraint is not computational anymore. Jinghua Piao's AgentSociety framework (Tsinghua, 2024) demonstrated that you can scale multi-agent simulations to 10,000+ agents with 5 million interactions using Ray distributed computing. The bottleneck is not the infrastructure. It is the problem of making sure your simulations remain grounded in reality and remain useful for decision-making.

‍

[[divider]]

‍

Why This Matters

‍

We started with a bold claim: the next war will be simulated before it is fought. This is not aspirational. It is descriptive of where strategic planning is headed, whether we like it or not. The question is whether that simulation will be rigorous, grounded in evidence about how LLMs actually behave, and structured to surface the most consequential insights, or whether it will be sloppy, overconfident, and blind to escalation risks.

‍

We believe the path forward requires what we are building: a complete architecture that takes the empirical findings from Lamparth, Rivera, Elbaum, and others seriously. That means building escalation awareness directly into the behavioral model rather than hoping it will sort itself out. That means running enough simulations to see outcome distributions rather than relying on expert intuition. That means having causal audit trails so analysts can see exactly why a scenario unfolded as it did and what would need to change to produce a different outcome.

‍

The world is not going to stop using LLMs for wargaming because researchers published papers showing they escalate. If anything, the pressure to deploy them faster is increasing. Nations that can simulate their way to better strategic understanding will move faster than those that rely on traditional methods. The only way to manage this transition safely is to build the right infrastructure now, before the alternative is irrelevant.

‍

That is what we are here to do. The war will be simulated. We are building the simulation platform that makes that process rigorous, transparent, and useful for actual defense decision-making.

More Insights