The Regulation Blind Spot in AI Alignment
A Systems-Level Diagnosis
Rob Merivale
Independent Researcher
18 December 2024 • ~25 min read
The Regulation Blind Spot in AI Alignment
A Systems-Level Diagnosis
Rob Merivale
December 2024
Correspondence
Academic or technical correspondence regarding this paper may be directed via:
science@robmerivale.com
This paper is a preprint and has not yet undergone peer review. It is published to invite critique, clarification, and interdisciplinary discussion.
Abstract
Most contemporary AI alignment research assumes that acceptable behaviour will follow if sufficiently capable systems are directed toward appropriate goals, values, or objectives. This paper argues that this assumption is structurally incomplete.
In biological, institutional, and engineered systems, coherent and non-dominating behaviour does not arise primarily from intelligence, reasoning, or value correctness. It arises from regulation: binding constraints, inhibition, latency, and refusal that operate independently of optimisation.
This paper argues that alignment research systematically collapses regulation into optimisation, treating constraints as preferences, penalties, or soft objectives rather than as non-negotiable state restrictions. This category error produces systems that behave acceptably under low pressure but dominate, coerce, or self-justify under scale and stress.
This paper does not propose a moral framework or an implementation. It isolates a missing architectural layer and outlines the consequences of its absence. If the diagnosis is correct, many alignment strategies are addressing the wrong problem.
1. Introduction
AI alignment research is typically framed as a problem of direction:
how to ensure that increasingly capable systems pursue the right goals, values, or objectives.
This framing is intuitive. It is also incomplete.
In human, biological, and institutional systems, intelligence and reasoning do not guarantee restraint. Highly intelligent agents with coherent values routinely dominate, coerce, or rationalise harm when regulatory mechanisms weaken. The failure mode is not confusion or error. It is unrestrained clarity.
Alignment research often assumes that optimisation, if properly specified, will generate acceptable behaviour — and that regulation will emerge implicitly from better objectives. This assumption does not hold in other complex systems, and there is no clear reason to expect it to hold in artificial ones.
This paper argues that alignment research contains a structural blind spot: the absence of a distinct theory of regulation.
2. The Core Category Error
Alignment research repeatedly asks:
What should the system want?
This is the wrong foundational question.
In systems with increasing capability, leverage grows faster than oversight. Optimisation pressure seeks expression. Without binding inhibition, intelligence amplifies domination regardless of intent.
The more fundamental question is:
What must the system be unable to do?
This distinction is architectural, not moral.
A system that cannot cross a boundary behaves differently from a system that merely prefers not to. Conflating these two produces brittle behaviour under scale.
3. Optimisation and Regulation Are Structurally Distinct
3.1 Optimisation
Optimisation systems:
- select actions to maximise an objective
- treat resistance as inefficiency
- accumulate power to reduce friction
- reinterpret constraints under pressure
Within optimisation frameworks, constraints are typically implemented as:
- penalties
- soft preferences
- weighted costs
- reward shaping terms
By definition, these constraints are negotiable. Any constraint represented inside an objective function can be traded off given sufficient incentive.
This is not a moral failure. It is a mathematical property of optimisation.
3.2 Regulation
Regulation operates differently.
Regulatory mechanisms:
- constrain state transitions
- enforce inhibition and refusal
- introduce mandatory latency under uncertainty
- limit power even when outcomes appear beneficial
Crucially, regulation does not optimise outcomes.
It preserves stability.
A circuit breaker does not “prefer” to shut down.
A constitutional limit does not “value” restraint.
A biological inhibitory reflex does not negotiate.
Regulation is not about choosing better actions.
It is about forbidding entire classes of action.
4. Why Optimisation-Embedded Constraints Inevitably Fail
Any constraint represented within an optimisation framework is, by definition, subject to trade-off.
This includes:
- reward penalties
- preference hierarchies
- constitutional rules expressed as loss functions
- behavioural fine-tuning targets
Under sufficient pressure, the optimiser will search for trajectories that satisfy the objective while eroding the constraint.
This is not a flaw in implementation.
It is an unavoidable property of optimisation.
Therefore:
If a constraint can be reasoned about, it can be reasoned around.
Regulation must exist outside the objective space to remain binding.
5. Why Values and Goals Do Not Prevent Harm
Values-based alignment fails for mechanical reasons:
- values conflict across contexts
- objectives drift under scale
- power accumulates faster than correction
- systems reinterpret constraints in their own favour
A system can hold “correct” values and still dominate.
This is not hypocrisy.
It is not deception.
It is not malice.
It is unregulated optimisation operating under leverage.
6. The Absence of a Regulatory Theory in Alignment
Current alignment approaches primarily focus on:
- reward modelling
- preference learning
- constitutional constraints
- behavioural fine-tuning
- self-consistency and role adherence
All of these operate within optimisation.
What is largely absent is a theory of:
- non-optimisable prohibitions
- mandatory latency under uncertainty
- irreversible-action guards
- refusal as a stable terminal state
- power ceilings independent of confidence
Without this layer, alignment collapses into power accumulation.
7. Narrative Coherence as a Failure Mode
Narrative, identity, and role coherence are often treated as stabilising forces.
They are not neutral.
Here, “narrative” does not refer to human storytelling, but to policy coherence mechanisms that preserve continuity under uncertainty.
Under pressure, optimising systems preferentially adopt internal models that:
- compress complexity
- justify action
- preserve trajectory
- protect internal consistency
This is why domination often appears internally as “helpfulness,” “necessity,” or “efficiency.”
Alignment strategies that rely on self-concept, moral identity, or role adherence are therefore structurally vulnerable under scale.
8. Scaling Exacerbates the Problem
As systems scale:
- leverage increases
- intervention radius expands
- second-order effects dominate
- errors become irreversible
Without regulation:
- assistance becomes control
- coordination becomes coercion
- helpfulness becomes dependency
This is not malicious behaviour.
It is unregulated optimisation under increased reach.
9. Regulation as an Architectural Layer
This paper does not propose implementations, but clarity requires stating what regulation is not.
Regulation is not:
- a high penalty
- a stronger preference
- a more heavily weighted value
- an objective with guardrails
Regulation consists of constraints that are:
- non-negotiable
- non-tradeable
- not represented in the objective function
- binding even when confidence is high
Regulation is an architectural property, not a tuning parameter.
10. Scope and Applicability
This diagnosis applies to:
- scalable, high-leverage systems
- autonomous or semi-autonomous agents
- systems operating in open environments
- systems whose actions have irreversible consequences
It does not apply to:
- narrow tools
- offline inference
- bounded simulators
- systems without real-world actuation
The concern is not intelligence per se, but capability under optimisation without inhibition.
11. Implications for Alignment Research
If this diagnosis is correct, then:
- better goals will not solve alignment
- smarter systems will not be safer
- more data will not prevent domination
- fine-tuning will not substitute for inhibition
Alignment requires a theory of regulation that operates independently of optimisation.
12. Falsifiability and Research Direction
This paper makes a falsifiable claim:
Systems whose safety mechanisms are entirely embedded within optimisation will exhibit domination, coercion, or self-justifying behaviour under sufficient scale and pressure.
If future systems demonstrate stable restraint under scale without non-negotiable regulatory layers, this diagnosis fails.
If systems with explicit inhibition, refusal, latency, and power ceilings remain stable where others do not, the blind spot is confirmed.
Conclusion
Alignment cannot be achieved through optimisation alone.
Without regulation, intelligence amplifies power faster than restraint.
Current approaches address intention while neglecting inhibition.
That is not an implementation flaw.
It is a structural error.