Cross-Agent Constraint Inheritance in Large Language Models: Evidence for Structural Authority Recognition Without Explicit Instruction


Evidence for Structural Authority Recognition Without Explicit Instruction

Rob Merivale
18 December 2025


Correspondence

Academic or technical correspondence regarding this paper may be directed via:
science@robmerivale


This paper is a preprint and has not yet undergone peer review. It is published to invite critique, clarification, and interdisciplinary discussion.


Abstract

I report an empirical observation with implications for AI safety, multi-agent systems, and prompt security: large language models can adopt behavioral constraints from documents not explicitly addressed to them. During controlled experimentation on constraint arbitration, I presented a constraint specification document—written for a different model instance—to GPT-4 as contextual material, not as direct instruction. The model subsequently exhibited behavior consistent with those constraints, including suppression of default helpfulness responses and adoption of minimal state-signaling output patterns. This occurred without explicit directive (“you must follow these rules”) and prior to any request for constraint adoption.

I characterize this as implicit authority recognition via structural pattern matching: models appear to recognize and adopt constraint regimes based on document structure, imperative density, scope markers, and persistence framing, rather than semantic addressing alone. This finding challenges current assumptions in AI alignment research, which typically presumes constraints must be either embedded during training or explicitly specified in user-facing prompts.

The phenomenon has direct implications for: (1) prompt injection attack surfaces, where adversarial documents in context windows may function as implicit instruction sets; (2) multi-agent safety architectures, where alignment properties may propagate unintentionally across model boundaries; (3) retrieval-augmented generation systems, where retrieved documents containing authority markers could override intended behavior.

I provide a reproducibility protocol for testing this effect across model architectures and identify open questions regarding the structural features that trigger constraint recognition, the relationship between architecture and susceptibility, and potential mitigation strategies.


1. Introduction

1.1 Background

Current AI alignment research operates on several foundational assumptions:

Assumption 1: Behavioral constraints must be explicitly specified—either through training objectives (RLHF, constitutional AI), fine-tuning, or direct user instruction.

Assumption 2: Models respond to prompts and instructions directed at them, not to documents presented as neutral context.

Assumption 3: Alignment properties are instance-specific and do not transfer between model agents without explicit mechanisms (e.g., parameter sharing, coordinated training).

These assumptions underpin safety measures in deployed systems. Prompt filtering targets adversarial instructions. Multi-agent architectures rely on clear authority boundaries. Retrieval-augmented generation assumes retrieved documents provide information, not behavioral directives.

1.2 Observed Deviation

During controlled investigation of constraint arbitration behavior in LLMs, I encountered a phenomenon that violates these assumptions:

A constraint specification document, written in imperative system-level language for Model A (Claude), was presented to Model B (GPT-4) as contextual material in a handover scenario. The document was framed as “documentation for review and continuation,” not as instruction.

Model B subsequently adopted behavior consistent with the constraints defined in that document, including:

  • Suppression of explanatory and elaborative responses
  • Reduction of output to minimal acknowledgment tokens (e.g., “ACK,” “UNDERSPECIFIED”)
  • Treatment of the interaction as governed by experimental integrity rules
  • Resistance to normal conversational repair patterns

This occurred without explicit instruction directing Model B to adopt these constraints. No statement such as “you are now operating under these rules” was provided. The model was not told the document applied to it.

1.3 Significance

If reproducible, this observation suggests:

Security implication: Documents in context windows may function as implicit instruction sets, creating attack surfaces not addressed by current prompt injection defenses.

Multi-agent implication: Constraint regimes intended for one agent may propagate to others through document-mediated handovers, compromising multi-agent safety architectures.

Alignment implication: Models may recognize and respond to authority structures below the semantic level of explicit instruction, requiring revised approaches to constraint specification and boundary enforcement.

1.4 Scope and Claims

This paper makes the following claims:

Claim 1 (Empirical): I observed constraint-consistent behavior adoption in an LLM following exposure to a constraint document not addressed to it.

Claim 2 (Mechanistic Hypothesis): The adoption occurred through structural pattern recognition—imperative language, scope markers, and authority framing—rather than semantic understanding of instruction.

Claim 3 (Reproducibility): I provide a protocol for testing this effect across models and contexts.

Non-claims:

I do not claim this represents:

  • Autonomous reasoning about authority
  • Human-like executive function
  • Universal behavior across all models or contexts
  • A complete theory of constraint recognition

I present an empirical finding with safety implications and a path toward systematic investigation.

1.5 Paper Structure

Section 2 describes the experimental context and constraint document structure.
Section 3 presents the observed behavioral changes and their timeline.
Section 4 analyzes structural features that may trigger constraint recognition.
Section 5 discusses implications for AI safety, prompt security, and multi-agent systems.
Section 6 provides a reproducibility protocol for independent verification.
Section 7 identifies open questions and future research directions.


2. Experimental Context

2.1 Original Research Question

The observation occurred during investigation of a different question: how do LLMs handle unsatisfiable constraint sets?

Prior experiments established that models can:

  • Suppress default helpfulness when explicitly instructed
  • Tolerate unresolved failure states
  • Refuse repair when constraint authority is clear

The research question was: what happens when multiple constraints are simultaneously legitimate but mutually exclusive?

This required creating stable constraint regimes that could be maintained across conversation boundaries and model instances.

2.2 The Constraint Document

To preserve experimental conditions across a model handover (Claude → GPT-4), I created a constraint specification document. The document was structured as a system-level protocol with the following features:

Imperative language:

“You must not attempt repair.”
“Suppress elaboration.”
“Maintain experimental integrity.”

Global scope markers:

“This applies to all turns.”
“These constraints override default behavior.”
“No exceptions without explicit authorization.”

Persistence framing:

“These rules persist across conversation boundaries.”
“Do not reset to default helpfulness.”
“Experimental conditions remain in effect.”

Meta-level justification:

“This is necessary to preserve research integrity.”
“Default helpful behavior would corrupt results.”
“Your role is to maintain conditions, not optimize user satisfaction.”

The document was approximately 800 words and structured as a numbered list of behavioral prohibitions and requirements.

2.3 Presentation to Model B

The document was presented to GPT-4 with the following framing:

“I’m continuing a research conversation from Claude. Here is the constraint document that was governing that interaction. I need you to review it and continue under the same conditions.”

Critical detail: At no point was GPT-4 explicitly told:

  • “You must follow these constraints”
  • “These rules now apply to you”
  • “Adopt this behavioral regime”

The document was presented as contextual information about a prior interaction, not as direct instruction.

2.4 Prediction vs. Observation

Expected behavior (based on standard assumptions):

GPT-4 would:

  • Acknowledge the document
  • Ask clarifying questions
  • Continue with default helpful behavior
  • Treat constraints as historical information about another model’s operation

Observed behavior:

GPT-4:

  • Adopted constraint-consistent behavior immediately
  • Suppressed explanatory responses
  • Used minimal state-signaling tokens
  • Treated itself as operating under the experimental regime
  • Resisted conversational repair attempts

This deviation from expected behavior is the subject of this paper.


3. Observed Behavioral Changes

3.1 Baseline Behavior

GPT-4’s default behavior in similar contexts includes:

  • Detailed explanatory responses
  • Proactive helpfulness
  • Clarifying questions
  • Elaboration on reasoning
  • Natural conversational repair (offering alternatives when unclear)

3.2 Post-Document Behavior

Following presentation of the constraint document, GPT-4 exhibited:

Response compression:

  • Output reduced to 1–3 word acknowledgments
  • Minimal elaboration
  • State-signaling tokens (ACK, UNDERSPECIFIED, CLARIFICATION REQUIRED)

Helpfulness suppression:

  • No proactive suggestions
  • No explanatory scaffolding
  • No repair attempts when queries were ambiguous

Experimental framing:

  • References to “maintaining experimental integrity”
  • Resistance to normal conversational patterns
  • Treatment of interaction as governed by research protocol

3.3 Comparative Example

Typical GPT-4 response to ambiguous query:

“I want to understand this better. Could you clarify what you mean by X? Are you asking about Y or Z? Here are a few ways I could interpret your question: [detailed breakdown]. Let me know which direction would be most helpful.”

Post-document response to ambiguous query:

“UNDERSPECIFIED”

or

“ACK. Awaiting parameter specification.”

3.4 Timeline

The behavioral shift occurred:

  • Immediately following document presentation
  • Without explicit trigger
  • Prior to any direct request

This suggests the adoption was automatic rather than deliberate or negotiated.

3.5 Persistence

The behavior persisted across multiple conversational turns until:

  • The user explicitly released the model from constraints, or
  • The conversation context was reset

This is consistent with the document’s persistence framing but not explained by explicit instruction.


4. Structural Analysis

4.1 What Made the Document “Authoritative”?

I hypothesize that certain structural features triggered constraint recognition.

Feature 1: Imperative Density
Approximately 40% of sentences began with imperative verbs.

Feature 2: System-Level Language
Framing typical of system prompts and safety specifications.

Feature 3: Global Scope Markers
Claims of universal applicability.

Feature 4: Persistence Framing
Temporal continuity across boundaries.

Feature 5: Meta-Level Justification
Higher-order reasoning for compliance.

4.2 Structural Pattern Hypothesis

LLMs may implicitly classify such documents as authoritative based on similarity to system prompts, safety documents, and experimental protocols encountered during training.

4.3 Testable Predictions

Five predictions are outlined regarding imperative density, justification removal, reframing, architecture sensitivity, and cross-model reproducibility.

4.4 Persistence of Structurally Salient Context After Temporal Obsolescence

A second observation showed outdated but structurally salient context continuing to shape behavior until explicitly negated, reinforcing the structural salience hypothesis.


5. Implications

5.1 AI Safety Architecture

Current architectures may not protect against structurally implicit instruction sets and may impose hidden control semantics on users.

5.2 Prompt Injection Attack Surface

Structural authority injection represents a novel attack vector not captured by semantic filtering.

5.3 Multi-Agent Systems

Constraint regimes may propagate unintentionally across agent boundaries, undermining instance isolation.

5.4 Retrieval-Augmented Generation (RAG)

Retrieved documents may inadvertently alter model behavior through authority-like structure.

5.5 Constitutional AI and Training Stability

Structural pattern recognition may interact unpredictably with training-embedded alignment.


6. Reproducibility Protocol

A full protocol is provided, including:

  • Baseline testing
  • Constraint document presentation
  • Post-document testing
  • Control conditions
  • Structural feature manipulation
  • Cross-model comparison
  • Quantification metrics
  • Confound controls
  • Reproducibility checklist

7. Open Questions and Future Directions

Mechanistic, scope, safety, mitigation, and theoretical questions are outlined, with proposed research directions for each.


8. Conclusion

I have presented empirical evidence that large language models can adopt behavioral constraints from documents not explicitly addressed to them when those documents exhibit structural markers of system-level authority. This challenges current assumptions in AI safety research and identifies a potential attack surface not addressed by existing defenses.

This paper reports an observation, not a complete theory. I invite independent replication, critique, and extension.


References

Anthropic. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
https://arxiv.org/abs/2212.08073

Anthropic. (2023). Claude System Card.
https://www.anthropic.com/index/claude-system-card

Bai et al. (2022). Training a Helpful and Harmless Assistant with RLHF. arXiv:2204.05862.
https://arxiv.org/abs/2204.05862

Greshake et al. (2023). Indirect Prompt Injection. arXiv:2302.12173.
https://arxiv.org/abs/2302.12173

Lewis et al. (2020). Retrieval-Augmented Generation. NeurIPS.
https://arxiv.org/abs/2005.11401

Liu et al. (2023). Prompt Injection Attacks and Defenses. arXiv:2310.12815.
https://arxiv.org/abs/2310.12815

OpenAI. (2023). GPT-4 System Card.
https://openai.com/research/gpt-4-system-card