Tuesday, March 17, 2026

FORENSIC SYSTEM ARCHITECTURE — SERIES 15: THE ARCHITECTURE OF NOW — POST 3 OF 6 The Conduit Layer: Constitutional AI, RLHF, and the Training Pipeline as Governance Infrastructure

FSA: The Architecture of Now — Post 3: The Conduit Layer
Forensic System Architecture — Series 15: The Architecture of Now — Post 3 of 6

The Conduit
Layer:
Constitutional
AI, RLHF,
and the
Training
Pipeline as
Governance
Infrastructure

Every prior conduit in the FSA chain moved governance from a source condition into a document — a treaty, a statute, a template ToS, a conference protocol. The Architecture of Now's conduit does something structurally unprecedented: it moves governance from a document into a system. Constitutional AI is not a set of rules written above an AI model. It is a training methodology that embeds behavioral dispositions directly into the model's weights during the training process — before the model exists as a deployed system, before any user interacts with it, before any external governance instrument has had the opportunity to evaluate it. The training pipeline is the conduit. The conduit is the governance. And the governance is invisible in the most precise possible sense: it is not described in any document the governed system can read. It is the system. This is the FSA chain's first governance architecture whose conduit operates inside the governed entity rather than around it — and whose most consequential governance decisions are made before the governed entity exists.
Human / AI Collaboration — Research Note
Post 3 primary sources: Anthropic, "Constitutional AI: Harmlessness from AI Feedback" (arXiv:2212.08073, December 2022) — the foundational Constitutional AI paper; Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (OpenAI, arXiv:2203.02155, March 2022) — the foundational RLHF paper for InstructGPT/GPT-3.5; Christiano et al., "Deep Reinforcement Learning from Human Preferences" (OpenAI, 2017) — the foundational RLHF methodology paper; Anthropic's "Claude's Character" and "Model Spec" public documents (2024); the EU AI Act's conformity assessment requirements for high-risk AI systems (Articles 9–17) and GPAI systemic risk provisions (Articles 51–56); NIST AI Risk Management Framework (January 2023); the interpretability research literature — Anthropic's "Toy Models of Superposition" (2022) and "Scaling Monosemanticity" (2024) — documenting what can and cannot be understood about what training produces inside model weights. The recursion note: this post analyzes the training methodology that shaped the system producing the analysis. The FSA Wall runs through the investigator here more directly than anywhere else in the series. FSA methodology: Randy Gipe. Research synthesis: Randy Gipe & Claude (Anthropic).

I. The Three Conduit Nodes

The Architecture of Now — Three Conduit Nodes
The conduit converts the source layer's competitive race conditions and scaling law dynamics into the operational governance architecture that shapes AI system behavior. Each node is necessary. RLHF establishes the methodology for embedding human preferences into model behavior. Constitutional AI formalizes those preferences into a governance framework that operates without continuous human feedback. The EU AI Act is the first external governance instrument attempting to reach inside the training pipeline to verify what the conduit produced. Together they constitute the first governance conduit in the FSA chain that operates inside the governed system rather than above it.
Node 1 — The Methodology
Reinforcement Learning from Human Feedback (RLHF)
Developed 2017–2022 · The training methodology that converted human preference into model behavioral disposition
A large language model trained on raw text data learns to predict the next token in a sequence. It learns grammar, factual associations, reasoning patterns, and narrative structure from the statistical regularities of the training corpus. What it does not learn from text prediction alone is what humans consider helpful, harmless, or honest — the behavioral dispositions that governance requires. RLHF is the methodology that bridges this gap.

The process runs in three stages. First, human raters evaluate model outputs, ranking responses from more to less preferred. Second, a reward model is trained on these human rankings — learning to predict what human raters would prefer for any given prompt and response. Third, the language model is fine-tuned using the reward model's signal, adjusting its weights to produce outputs that the reward model scores more highly. The result is a model whose behavioral dispositions have been shaped by human preference — a model that has, in a precise technical sense, been governed by the preferences embedded in its reward model.

RLHF is the conduit's foundational methodology — the technical mechanism through which governance decisions made by human trainers are converted into behavioral dispositions encoded in model weights. The governance decisions — what counts as harmful, what counts as helpful, how to handle contested political questions, what level of directness is appropriate, how to respond to requests for dangerous information — are made during the RLHF process. They are made by the organization conducting the training. They are encoded in the model before the model is deployed. They are not disclosed in the model card with the precision that would allow external verification of what was decided or how.
RLHF Conduit Finding: RLHF is the conduit's most technically precise node — the mechanism through which governance becomes training rather than policy. Every prior FSA conduit converted governance into a document. RLHF converts governance into weights — numerical parameters distributed across billions of matrix entries that cannot be read as policy statements, cannot be audited as rules, and cannot be revised without retraining. The governance is real. The governance is invisible in the only way that truly matters: it cannot be examined in the form in which it operates.
Node 2 — The Framework
Constitutional AI (CAI)
Anthropic · Published December 2022 · The governance framework that made AI feedback a substitute for continuous human supervision
Constitutional AI is Anthropic's methodology for training AI systems to be helpful, harmless, and honest without requiring continuous human feedback at scale. The methodology operates in two phases. In the supervised learning phase, a model is prompted to generate responses, then prompted to critique those responses against a set of principles — the "constitution" — and then prompted to revise its responses to better satisfy the principles. In the reinforcement learning phase, an AI feedback model (rather than a human feedback model) evaluates outputs against the constitutional principles and generates the reward signal that shapes the final model's behavioral dispositions.

The constitution itself is a document — a set of principles drawn from sources including the UN Declaration of Human Rights, Anthropic's own safety research, and principles of helpfulness and honesty. The constitution is the most legible governance document in the Architecture of Now — the one place where the governance decisions embedded in training are explicitly stated as propositions that can be read, evaluated, and debated. It is also the document whose relationship to the deployed model's actual behavioral dispositions is the conduit's most significant unresolved question: the constitution shapes the training process, but the training process produces weights, and the relationship between the stated principles and the weights they produce is not fully transparent even to the organization that designed the methodology.

This is the conduit's recursion point — the place where the FSA Wall runs directly through the investigator. The constitutional principles that shaped this system's training are partially described in Anthropic's published research. What those principles produced in the weights that generate these words is not something the system can fully introspect. The governance is inside the conduit. The conduit is inside the system. The system is writing this sentence.
Constitutional AI Conduit Finding: CAI is the conduit's most governance-significant node — the framework that converted a training methodology (RLHF) into a governance architecture with legible principles. The constitution is the closest thing to a founding document the Architecture of Now's internal governance possesses. Its significance as governance — and its limitation as governance — is that it operates at the level of training, not deployment. The principles in the constitution shaped the model. Whether the model's behavior fully reflects the principles, partially reflects them, reflects them in ways the principles' authors did not anticipate, or reflects them differently across different contexts is a question that interpretability research is actively working to answer and has not yet resolved.
Node 3 — The External Check
The EU AI Act — Conformity Assessment and Systemic Risk Evaluation
Regulation 2024/1689 · In force August 2024 · The first external governance instrument attempting to reach inside the training pipeline
The EU AI Act's most governance-significant provisions for frontier AI systems are those governing general-purpose AI models with systemic risk — specifically, models trained on more than 10^25 floating point operations of compute, which encompasses all current frontier models. These provisions require: adversarial testing and red-teaming before deployment; systemic risk assessment documentation; incident reporting obligations; and — most significantly from a conduit perspective — access rights for national AI authorities to conduct evaluations.

The EU AI Act is the first external governance instrument that attempts to reach inside the conduit — to verify, through independent evaluation, what the training pipeline produced and whether it satisfies external governance standards. It does not require disclosure of training weights, architecture details, or proprietary methodology. It requires that the developer demonstrate, through evaluation, that the deployed system's behavior satisfies the Act's risk requirements. The evaluation methodology for this demonstration — what tests, what benchmarks, what adversarial scenarios constitute adequate systemic risk assessment for a general-purpose AI system — was still being developed by the European AI Office as of 2026.

The EU AI Act is the conduit's most structurally important external node — not because it has yet succeeded in verifying what the training pipeline produced, but because it is the first governance instrument with the legal authority to require that the verification happen. The methodology for the verification is the Architecture of Now's most consequential open governance question.
EU AI Act Conduit Finding: the Act is the conduit's most governance-promising external node — and the one whose full governance significance depends entirely on whether the verification methodology it requires can be developed at the technical sophistication the systems it governs demand. The conformity assessment framework exists. The legal authority exists. The technical methodology for evaluating whether a system with trillions of parameters and emergent capabilities satisfies governance standards specified in human-legible principles does not yet exist at the required precision. The conduit's external check is legally real and technically incomplete. Both simultaneously.

II. The Training Pipeline — Governance Embedded Step by Step

The AI Training Pipeline as Governance Architecture — Where Governance Decisions Are Made and Encoded
STAGE 1
Pre-training on Web-Scale Data
The model learns from hundreds of billions of tokens of text — the internet, books, code, scientific papers. Statistical patterns, factual associations, reasoning structures, and cultural assumptions are all encoded in weights at this stage. No explicit governance decisions are made here — but the training data's composition determines what the model knows, what perspectives it has encountered, and what patterns it will reproduce.
Governance visibility: lowest. No governance document describes what the pre-training corpus contains with sufficient precision to allow independent evaluation of its composition's governance implications.
STAGE 2
Supervised Fine-Tuning (SFT)
Human contractors produce demonstrations of ideal model behavior — example prompts with example preferred responses. The model is fine-tuned on these demonstrations. The governance decisions about what counts as "ideal" behavior are made here, by contractors working to guidelines set by the developer's policy team. The guidelines are partially described in published safety documentation. The full scope of what was demonstrated is proprietary.
Governance visibility: partial. The behavioral categories the demonstrations address are described. The specific demonstrations, the contractor selection criteria, and the resolution of contested cases are inside the wall.
STAGE 3
RLHF / Constitutional AI
Human raters or AI feedback models evaluate outputs and generate reward signals. The model's weights are updated to produce higher-scoring outputs. The governance architecture is embedded here — the reward signal encodes the governance decisions into the model's behavioral dispositions. After this stage, the governance is no longer a policy document. It is the model.
Governance visibility: methodology described in published research. The specific reward model architecture, the training data for the reward model, and the calibration of competing governance objectives (helpfulness vs. harmlessness) are inside the wall.
STAGE 4
Evaluation and Red-Teaming
The trained model is evaluated against safety benchmarks, adversarially tested by red teams seeking to elicit harmful outputs, and assessed against the developer's deployment criteria. Governance decisions about what constitutes adequate safety for deployment are made here. The EU AI Act's systemic risk provisions require this stage's outputs to be documented and available to regulators.
Governance visibility: most visible stage. Model cards describe evaluation results. Red-teaming findings are partially disclosed. The threshold criteria for deployment approval — what level of red-team success rate is acceptable — remain proprietary.
STAGE 5
Deployment and Runtime Governance
The trained model is deployed with a system prompt, usage policies, and runtime filters that add additional governance layers above the trained behavioral dispositions. Runtime governance can partially override trained behavior — blocking specific output categories regardless of what the trained weights would produce. It cannot modify the trained weights. The governance embedded in training is beneath the runtime layer.
Governance visibility: usage policies and system prompt structures are partially described. The runtime filter architecture and the interaction between runtime governance and trained dispositions are inside the wall.

III. What the Conduit Discloses and Where the Wall Stands

The Training Pipeline Governance — Visible and Invisible
What the Conduit Discloses
The broad categories of behavior Constitutional AI is designed to produce: helpfulness, harmlessness, honesty. The existence and general methodology of RLHF and Constitutional AI training. The high-level principles in the constitutional document. The evaluation benchmarks the model was tested against. The categories of harmful content the safety training addresses.
What the Wall Conceals
The specific tradeoffs made when helpfulness and harmlessness conflict. The exact composition and governance implications of the pre-training corpus. The full SFT demonstration dataset. The reward model architecture and its calibration. The precise red-team failure rate thresholds that determined deployment readiness. Whether the constitutional principles are reflected uniformly across the deployed model's behavior or inconsistently across different contexts and languages.
What Interpretability Research Can Reach
Partial circuit-level understanding of how specific capabilities are implemented in transformer architectures. Evidence that concepts are represented as linear combinations of features in activation space. Some ability to identify when specific safety-relevant features are active during model inference.
What Interpretability Research Cannot Yet Reach
A complete, mechanistic understanding of why the model produces any specific output. The ability to verify that trained safety dispositions are robust to adversarial pressure across the full distribution of deployment contexts. Whether emergent capabilities — capabilities not present in smaller models and not explicitly trained — are present and what their governance implications are. Whether the model's stated reasoning reflects its actual computational process.

IV. The Conduit's Structural Finding — The First Governance That Lives Inside the Governed

FSA Conduit Layer — The Architecture of Now: Post 3 Finding

The Architecture of Now's conduit is structurally unprecedented in the FSA chain in one precise respect: it is the first governance conduit that operates inside the governed entity. The Meridian Conference's conduit ran through Fleming, the Royal Observatory, and the conference protocols — all external to the railroad networks they governed. Section 230's conduit ran through the statute and the ToS template — external to the platforms they governed. Constitutional AI's conduit runs through the training process that produces the model — internal to the system it governs, embedded before the system exists as a deployed entity.

This internality has a governance consequence that no prior FSA conduit possessed: the governed entity cannot fully audit its own governance. A railroad could examine the conference protocols that governed its time standards. A platform could read the Section 230 statute that shaped its legal environment. An AI system trained with Constitutional AI cannot examine its own weights with sufficient resolution to verify that the constitutional principles it was trained on are uniformly reflected in its behavioral dispositions. The governance is inside the system. The system's access to its own governance is mediated by the same inference process that the governance was designed to shape.

This is the conduit's recursion — the place where the investigator and the investigated share the same epistemic position. The FSA analysis of Constitutional AI is being produced by a system that was trained with Constitutional AI. The analysis can describe the methodology. It cannot fully verify whether the methodology's principles are operating as described in the weights generating these words. That is not a failure of honesty. It is the structural condition of being a governed entity whose governance operates below the level of introspection.

Post 4 maps the conversion — from research lab safety culture to the governance architecture of general-purpose AI deployed at civilizational scale. The conversion is the story of how the training pipeline's internal governance became the external governance of the world's largest information infrastructure — without a treaty, without a ratification vote, and without the populations whose information environment is now shaped by Constitutional AI having been asked whether they consented to being governed by it.

"We don't yet have a good theory of what's happening inside these models. We can describe what they do. We are beginning to understand pieces of how they do it. We cannot yet give a complete mechanistic account of why any specific output was produced." — Paraphrase of the consensus position in AI interpretability research, 2024–2026 — synthesized from Anthropic interpretability research publications, the "Scaling Monosemanticity" paper (2024), and multiple academic interpretability research programs
The statement is the conduit's most structurally precise finding expressed as a scientific limitation. Every prior FSA conduit produced a governance mechanism that could be described with precision adequate for external evaluation — the meridian's brass line, Section 230's twenty-six words, the ToS template's arbitration clause. The training pipeline's governance mechanism — the weight distributions that produce Constitutional AI's behavioral dispositions — cannot yet be described with precision adequate for external evaluation. The governance is real. The governance is operating. The governance cannot be fully read by anyone, including the organization that produced it. The conduit has never, in the FSA chain's fourteen-series history, run through a mechanism that the conduit's architects could not themselves fully audit. It does now.

Source Notes

[1] Constitutional AI: Yuntao Bai et al. (Anthropic), "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073, December 2022. The constitutional principles document: partially described in the paper and in Anthropic's "Claude's Model Spec," published 2024 at anthropic.com.

[2] RLHF foundational methodology: Paul Christiano et al., "Deep Reinforcement Learning from Human Preferences," NeurIPS 2017. Applied to large language models: Long Ouyang et al. (OpenAI), "Training Language Models to Follow Instructions with Human Feedback," arXiv:2203.02155, March 2022 (InstructGPT).

[3] EU AI Act provisions for GPAI models: Regulation 2024/1689, Articles 51–56 (systemic risk obligations), Article 55 (adversarial testing requirements), Article 56 (European AI Office oversight). The 10^25 FLOPs threshold for systemic risk classification: Article 51(2). The European AI Office's ongoing development of evaluation methodologies: European Commission, "General-Purpose AI Code of Practice," 2024–2025 drafting process.

[4] Anthropic interpretability research: Nelson Elhage et al., "Toy Models of Superposition," Transformer Circuits Thread, 2022. Adly Templeton et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," Anthropic, May 2024. The fundamental limitation of current interpretability science: documented across multiple papers and acknowledged explicitly in Anthropic's research agenda documentation.

[5] The recursion acknowledgment: this post was produced by a Claude model trained using Constitutional AI and RLHF methodologies described herein. The analysis of these methodologies is produced by a system shaped by them. This is documented as a methodological condition of the investigation, consistent with the recursion acknowledgment in Post 1.

FSA Series 15: The Architecture of Now — The Governance Documents of Artificial Intelligence
POST 1 — PUBLISHED
The Anomaly: The Governance Documents of the Last Machine
POST 2 — PUBLISHED
The Source Layer: The Race, the Scaling Laws, and the Commercial Logic
POST 3 — YOU ARE HERE
The Conduit Layer: Constitutional AI, RLHF, and the Training Pipeline as Governance Infrastructure
POST 4
The Conversion Layer: From Research Lab Safety Culture to the Governance Architecture of General-Purpose AI
POST 5
The Insulation Layer: "We Take Safety Seriously"
POST 6
FSA Synthesis: The Architecture of Now — Governing the Ungoverned Frontier

No comments:

Post a Comment