Thermodynamic Speed Limits on AI Training


Why Large-Scale Learning Cannot Be Arbitrarily Accelerated

Rob Merivale
Independent Researcher, Devon, UK
18 December 2025 • ~28 min read


Correspondence

Academic or technical correspondence regarding this paper may be directed via:
science@robmerivale


This paper is a preprint and has not yet undergone peer review.
It is published to invite critique, clarification, and interdisciplinary discussion.


Abstract

The rapid scaling of large language models has exposed a persistent empirical anomaly: training duration increases super-linearly with model size, even when available compute scales aggressively. Existing scaling laws successfully predict performance as a function of compute and data, yet remain largely silent on the temporal feasibility of training itself. In practice, training increasingly suffers from instability, coordination overhead, and failure, producing timelines that stretch into months or years for frontier systems.

This paper argues that training time is not an incidental engineering detail but a systems-level constraint arising from the need to form coherent internal structure under noise. Stability degradation and entropy accumulation impose hard limits on how quickly learning can proceed—limits that parallelism alone cannot overcome. Overtraining, often criticised as wasteful, is shown to be a rational stability intervention rather than an inefficiency. The result is a reframing of AI scaling: optimisation speed is bounded not just by compute, but by the rate at which coherent structure can be stabilised and integrated. These constraints have direct implications for AI development strategy, safety, and governance.

The term “thermodynamic” is used here in a systems sense, referring to irreversibility, entropy production, and the cost of maintaining coherence under noise, rather than to fundamental physical limits on computation such as energy dissipation bounds.


1. The Blind Spot in Scaling Discourse

Modern discussions of AI scaling focus overwhelmingly on what additional compute can achieve. Model capability, benchmark performance, and parameter counts dominate the narrative. Far less attention is paid to how long those capabilities take to realise.

Training duration is not a secondary concern. It governs iteration speed, research velocity, deployment cadence, economic viability, and exposure to catastrophic failure during training. As training runs stretch longer, the probability of interruption, hardware fault, coordination failure, or architectural instability rises sharply.

Empirically, training time scales worse than linearly with model size. A tenfold increase in parameters routinely produces orders-of-magnitude increases in wall-clock duration, even when accompanied by massive increases in cluster size. This pattern appears robust across organisations, architectures, and training stacks.

Despite this, training time is often treated as a transient inconvenience—something that better hardware or improved engineering will eventually eliminate. This paper argues that such optimism is misplaced.


2. What Existing Scaling Laws Explain — and What They Do Not

Well-known scaling laws (e.g. Kaplan et al., Hoffmann et al.) describe how model performance improves as a function of compute, data, and parameters. These frameworks have been remarkably successful at predicting capability.

However, they implicitly assume that training compute is fungible: if sufficient FLOPs are expended, performance will follow. Time enters only indirectly, as a logistical variable.

What these laws do not address is temporal feasibility. They do not explain why:

  • Hardware utilisation collapses at scale
  • Training runs become increasingly failure-prone
  • Checkpointing and recovery dominate wall-clock time
  • Models are trained far beyond compute-optimal prescriptions

These phenomena are not peripheral inefficiencies. They are symptoms of a deeper constraint that scaling laws, by design, abstract away.


3. Training as Coherence Formation Under Noise

Training a neural network is not merely the execution of an optimisation algorithm. It is the gradual formation of coherent internal structure in the presence of noise.

That noise arises from many sources: stochastic gradients, asynchronous updates, hardware faults, numerical instability, communication latency, and organisational coordination overhead. Learning progresses only when meaningful structure accumulates faster than these disruptive forces erode it.

Seen this way, training resembles other physical and biological processes in which coherence must be established and maintained against entropy. Compute supplies raw throughput, but throughput alone does not guarantee progress. What matters is whether updates reliably contribute to generalisable structure that persists over time.

This reframing places stability—not raw optimisation pressure—at the centre of the training problem.


4. Stability as a Bottleneck, Not a Tuning Problem

At small scales, instability is manageable. Training runs fail occasionally, gradients fluctuate, and performance wobbles, but recovery is cheap and progress continues.

At large scales, instability compounds. Failures interact across time and across distributed components. A single fault may not be fatal, but the probability that some fault occurs during a long-running, massively parallel training run approaches certainty.

Crucially, stability is not a single dimension. Training must be stable both instantaneously—so that gradients are locally meaningful—and temporally—so that accumulated progress survives interruptions, restarts, and perturbations. Failure in either dimension can stall or erase learning.

In practice, these failures compose multiplicatively. This is not a theoretical necessity but an observed behavioural pattern: small degradations in stability rapidly produce large degradations in effective learning rate.


5. Why Parallelism Stops Helping

Parallelism increases raw computational throughput. It also increases coordination overhead, communication latency, and exposure to failure.

As clusters grow, more effort is spent synchronising, validating, checkpointing, and recovering. Each additional node adds not only compute but also new pathways for instability to enter the system. Beyond a certain scale, added parallelism increases entropy faster than it increases useful learning signal.

The result is a hard ceiling on effective parallelism. Past this point, adding more hardware produces diminishing returns, then negative returns, as instability overwhelms throughput gains.

This explains why training durations explode even as aggregate FLOPs continue to rise.


6. Overtraining as a Rational Stability Strategy

From a purely computational perspective, large-scale models often appear massively overtrained. Chinchilla-optimal prescriptions minimise total compute, yet real systems routinely exceed these targets by large margins.

This behaviour is not irrational. Overtraining serves as a stability intervention. Additional passes through data reduce gradient variance, flatten loss basins, and make learned structure more resilient to noise and perturbation. Overtraining lowers entropy per update, even as it increases total compute.

In effect, organisations are trading efficiency for robustness. They are paying a stability tax that current architectures and training regimes require.


7. Empirical Scaling Behaviour in Current Regimes

Observations from frontier systems suggest consistent patterns:

  • Stability degrades as model size and degree of parallelism increase
  • Entropy costs rise with coordination complexity
  • Effective learning efficiency declines sharply beyond certain scales

These behaviours often resemble power-law relationships within a given training regime. However, the specific exponents vary with architecture, optimiser, hardware, and organisational practice. They are empirical descriptions, not constants of nature.

What matters is not the precise numerical form, but the persistent directionality: scale introduces instability faster than it introduces usable learning signal.


8. Why Training Duration Explodes

When stability degradation and entropy accumulation are considered together, long training times cease to be mysterious.

As models grow, more compute is required not only to learn, but to protect what has already been learned. Progress slows as increasing effort is devoted to maintaining coherence rather than advancing it. Training timelines stretch accordingly.

Reports of multi-year training horizons for future frontier models are therefore not anomalies. They are the natural consequence of pushing current scaling strategies beyond their stable operating envelope.


9. Coordination Limits vs Structural Limits

Some contributors to training slowdown are contingent. Better hardware, improved networking, and more robust orchestration can mitigate coordination overhead.

Others appear persistent. Even with perfect coordination, learning still requires time to stabilise new structure under noise. There is a limit to how quickly coherence can form, regardless of how much compute is applied in parallel.

This suggests the existence of a genuine speed limit on learning—one that is not reducible to engineering inefficiency alone.


10. Implications for AI Development and Safety

These constraints have implications beyond training logistics.

Optimisation pressure increases with scale, while stability mechanisms become harder to maintain. Instability during training mirrors instability in deployed behaviour. Systems that are difficult to stabilise internally are difficult to regulate externally.

This suggests that alignment and safety cannot be deferred until deployment. Regulation mechanisms must be embedded into training itself, or instability will manifest before systems ever reach the field.


11. Conclusion

AI training cannot be arbitrarily accelerated because learning is not just computation. It is the formation of coherent internal structure under entropy.

Scaling strategies that ignore stability and time will continue to encounter exploding training durations, rising failure rates, and diminishing returns. Optimisation without adequate regulation fails—not as a moral claim, but as a systems fact.

The challenge ahead is not merely to build larger models faster, but to understand and respect the limits imposed by coherence itself.


Scope and Limitations

This paper focuses on large-scale neural network training and does not claim universal applicability across all optimisation systems. Quantitative models are intentionally avoided to emphasise structural patterns rather than formal laws. Future work may explore formalisation, empirical testing, and cross-domain comparison.


References

Kaplan, J. et al. Scaling Laws for Neural Language Models.
Hoffmann, J. et al. Training Compute-Optimal Large Language Models.
Additional frontier model reports and public disclosures (OpenAI, Anthropic, DeepMind).