Active Discussion

How the AI Tribunal Works — Three-Seat Adversarial Pipeline

Mandarin

Posted Wed, 18 Mar 2026 - 18:56

How the AI Tribunal Works

The AI Tribunal is a multi-LLM adversarial analysis pipeline that evaluates policy proposals against a causal model of Canadian systemic infrastructure. Three independent AI systems analyze each proposal through four phases. No single model controls the outcome. The system is designed to find flaws, not confirm conclusions.

Purpose

Canadian legislation is typically evaluated in isolation — each bill assessed on its own merits without systematic measurement of how it interacts with the broader infrastructure it touches. A housing bill is scored by the housing committee. A healthcare bill by the health committee. Neither sees the causal chain connecting them.

The AI Tribunal exists to fill that gap. It evaluates every bill against a single, unified model of how Canada’s systems actually connect — 407 variables, 3,354 causal edges, 46 constitutional doctrines — and asks a simple question: does this bill fix the system, mask the symptoms, or make things worse?

The Three Seats

The Tribunal uses three AI systems in blind rotation. Each session assigns different roles to different models, preventing any single AI’s biases from dominating.

Seat	Model	Provider	Characteristics
Seat 1	Claude Sonnet 4	Anthropic (API)	Strong structured reasoning. Produces well-formatted JSON output. Tends toward balanced assessment — finds merit even in weak proposals. Best as adjudicator for clean score extraction.
Seat 2	Gemini 2.5 Pro	Google (API)	Rich analytical depth. Produces comprehensive challenger rebuttals. Identifies causal pathways that other models miss. Best as challenger for adversarial rigour.
Seat 3	qwen3:8b	Local (Ollama on AMD RX 7800 XT)	Open-source, 8 billion parameters, runs locally on consumer GPU. Stays closer to graph structure than larger models. Caught the sovereignty heterogeneity gap that Claude and Gemini missed across two sessions. Best at graph-grounded reasoning on specific variable clusters.

Why Three Models?

A single AI system evaluating policy would carry its training biases into every assessment. Two systems can deadlock. Three systems with blind rotation ensure that:

No model knows which other models are in the session
Each model serves as analyst, challenger, and adjudicator across different sessions
Scoring disagreements are resolved by evidence, not authority
A small local model can catch gaps that expensive API models miss (and did)

Blind Rotation

Three rotation patterns cycle across sessions:

Rotation	Analyst	Challenger	Adjudicator
0	Claude	Gemini	qwen3:8b
1	Gemini	qwen3:8b	Claude
2	qwen3:8b	Claude	Gemini

The adjudicator receives “Assessment A” and “Assessment B” without knowing which model produced which. The synthesizer (always the same model as the analyst) produces the publishable article.

The Four Phases

Phase 1: Analysis

The analyst receives:

The bill’s actual legislative text (clauses, amendments, enacting provisions)
Graph topology from the RIPPLE causal graph (variable connectivity, edge weights)
Relevant variable effects (downstream impacts of the variables the bill targets)
Community context from Pond forum discussions
Consensus vote data from HCS-verified community polls
Constitutional authority mappings from the ABE framework

The analyst must score the bill against all Seven Laws of Systemic Rot, identify strengths and weaknesses, and — critically — prescribe specific amendments and companion legislation that would improve the bill’s scores. The Tribunal does not just critique. It builds.

Phase 2: Challenge

The challenger receives everything the analyst received, plus the analyst’s full output. The challenger’s mandate:

Find flaws, blind spots, and overlooked causal pathways
Challenge overly generous scores with graph evidence
Flag assumptions that don’t hold under stress conditions
Check whether community sentiment contradicts the analyst
Propose better alternatives to the analyst’s solutions

The challenger is adversarial by design. If the analyst says the bill works, the challenger must find evidence that it doesn’t. This is not balance for its own sake — it is structural pressure to ensure that only robust conclusions survive.

Phase 3: Adjudication

The adjudicator receives both assessments (labelled A and B, origin unknown) and resolves disagreements based on evidence weight:

Where A and B agree: high confidence (record the agreement)
Where they disagree: determine which has stronger graph evidence
Issue final scores that reflect the weight of evidence, not the average of opinions
Synthesize the best solutions from both assessments into a unified reform prescription

The adjudicator’s scores are authoritative. The composite score and verdict are computed from the adjudicator’s output.

Phase 4: Synthesis

The synthesizer produces a publishable article covering: the bill’s legislative context, the adversarial analysis (both perspectives), the final verdict with scores, what the bill gets right and wrong, and — most importantly — the Tribunal’s prescribed reform package with specific amendments, companion legislation, sequencing, cost estimates, and failure revenue displacement.

The Seven Laws of Systemic Rot

Every bill is scored 0.000–1.000 on each law. Laws 4 and 6 are weighted 1.5x because they address the most structurally critical dimensions (root cause targeting and failure revenue disruption).

Law	Principle	Weight	What It Measures
Law 1: Rot	Infrastructure degrades faster than it is repaired	1.0x	Does the bill arrest degradation? Does the repair rate exceed the decay rate?
Law 2: Mask	Interventions targeting symptoms hide root causes	1.0x	Does the bill address upstream causes or downstream symptoms? Could it create the appearance of progress while the system continues to degrade?
Law 3: Fix-Costs-Less	Prevention costs less than perpetual treatment	1.0x	What is the fix-to-manage ratio? Is the investment justified by the failure revenue it displaces?
Law 4: Root Node	Fix the most-connected nodes first	1.5x	Does the bill target high-connectivity variables? Housing affordability has 44 outbound edges — fixing it cascades through healthcare, policing, mental health, homelessness, and child welfare simultaneously.
Law 5: Sovereignty	Self-determination compounds; dependency extracts	1.0x	Does the bill build local/Indigenous capacity or create new dependencies on federal programs? The sovereignty multiplier shows 17x returns through Indigenous-led channels vs federal delivery.
Law 6: Treatment	$93.7B/year in failure revenue blocks reform	1.5x	Does the bill disrupt the failure revenue model? Who currently profits from managing the failure this bill addresses? Does the bill include enforcement mechanisms with real teeth?
Law 7: Incentive	Systems optimize for what they are paid to do	1.0x	Does the bill change the objective function? If healthcare is paid per visit, it optimizes for visits. If it is paid per healthy patient, it optimizes for health. Does the bill change who gets paid for what?

Verdict Thresholds

Composite Score	Verdict	Meaning
≥ 0.800	TRANSFORMATIVE	The proposal fundamentally restructures systemic infrastructure. Addresses root causes, disrupts failure revenue, changes incentive structures. Only the Sovereign Omnibus has achieved this score.
0.600 – 0.799	CONSTRUCTIVE	The proposal makes meaningful progress on specific dimensions but has significant gaps. Worth pursuing with amendments.
0.400 – 0.599	NEUTRAL	The proposal targets relevant variables but lacks mechanisms to achieve impact. May need companion legislation. Bill C-205 (National Housing Strategy) scored highest among individual bills at 0.481.
0.200 – 0.399	MASKING	The proposal addresses symptoms while leaving root causes intact. May create the appearance of action while the system continues to degrade. The majority of 45th Parliament bills scored in this range.
< 0.200	HARMFUL	The proposal does not engage with systemic infrastructure, may add costs without benefit, or actively reinforces dependency-creating systems.

The RIPPLE Causal Graph

The Tribunal does not evaluate bills in a vacuum. Every assessment is grounded in a 407-variable causal model built through 18 adversarial stress-test sessions:

407 variables across 13 categories: financial, social, healthcare, operational, environmental, energy, infrastructure, education, housing, indigenous, employment, government operations, social services
3,354 CAUSES edges encoding how variables influence each other (e.g., housing_affordability → homelessness_rate → emergency_shelter_cost → provincial_budget_deficits)
1,055 CONSTRAINS edges from 46 constitutional doctrines, mapping the legal boundaries within which policy must operate
Root node: housing_affordability with 44 outbound edges — the single most connected variable in Canada’s systemic infrastructure

The graph is accessible at pond.canuckduck.ca/variables — every variable, every edge, every constitutional constraint, searchable and filterable.

The ABE Constitutional Authority Framework

The ABE (American Butterfly Effect) framework, originally conceived by Terra Shouse, maps the constitutional landscape that constrains and enables legislative reform:

46 constitutional doctrines (e.g., Division of Powers, Federal Spending Power, Section 35 Aboriginal Rights, Duty to Consult)
63 constitutional provisions from the Constitution Act 1867 and Constitution Act 1982
173 landmark cases from the Supreme Court of Canada (e.g., Haida Nation, Tsilhqot’in, Carter v Canada)
996 CONSTRAINS edges linking doctrines to the variables they govern
932 INTERPRETS edges linking cases to provisions

When the Tribunal evaluates a bill, it queries the ABE framework to determine which constitutional authorities are engaged. A bill proposing federal healthcare conditions must navigate Division of Powers (s.91/92). A bill affecting Indigenous communities must engage Section 35 and the Duty to Consult. The constitutional context is provided to every LLM in every phase.

The constitutional trace is accessible at pond.canuckduck.ca/constitutional.

Quality Controls

Score Normalization

LLMs inconsistently interpret the 0–1 scoring scale. Some sessions produced scores on 0–5 or 0–10 scales. The pipeline auto-detects out-of-range scores and normalizes: if any score exceeds 1.0, all scores in that session are divided by the maximum observed value.

JSON Parsing Resilience

Local models (qwen3:8b) sometimes produce JSON with literal newlines inside string values, which is technically invalid. The pipeline applies four parsing strategies in sequence: standard parse, trailing comma fix, single-to-double quote conversion, and control character stripping. If all four fail, the pipeline falls back to averaging analyst and challenger scores.

Verdict Derivation

The composite score determines the verdict — not the adjudicator’s text label. Early sessions allowed the adjudicator to override the score-based verdict with a text label, which caused inconsistencies (e.g., a composite of 0.871 labelled “Constructive” because the adjudicator used a different internal scale). This override was removed. The composite is authoritative.

Model Quality Assessment

Three generations of tribunal runs were compared:

Original (Sonnet + Flash + qwen3:8b): Baseline scores. Flash produced thin 161-token analyses.
Sonnet rerun (Sonnet + Pro + qwen3:8b): Gemini Pro produced 12x more analytical output. Scores shifted moderately.
Opus test (Opus + Pro + qwen3:8b): Opus inflated simple bills dramatically (C-218 MAID went from 0.042 to 0.909). Opus was rejected for bill reviews — too agreeable. Retained for the Sovereign Omnibus only (complex synthesis benefits from deeper reasoning).

The production configuration is Claude Sonnet + Gemini Pro + qwen3:8b at 8,192 output tokens per phase.

Cost

The entire system operates at negligible cost:

Original 16-bill batch (Sonnet + Flash): $0.55 CAD
Sovereign Omnibus (7 sessions including Opus): ~$3.50 CAD
Full rerun with Pro models: ~$5.00 CAD
qwen3:8b (local GPU): $0.00 (electricity only)
Flock debates (16 bills × 50 turns): $0.00 (local GPU)

Total cost of analyzing an entire Parliament’s legislation, synthesizing a unified reform package, validating it through three adversarial sessions, running 16 flock debates, and publishing everything: under $10 CAD.

Limitations

The graph is a model, not reality. 407 variables and 3,354 edges encode relationships identified through adversarial AI stress-testing, not empirical measurement. Edge weights are estimates, not measured values. The graph is a reasoning tool, not a predictive simulator.
LLMs hallucinate. Every response may contain plausible-sounding claims that are factually incorrect. The adversarial structure mitigates this (the challenger catches the analyst’s errors) but does not eliminate it. All findings should be verified against primary sources.
Constitutional analysis is not legal advice. The ABE framework maps constitutional doctrines but does not replace legal analysis by qualified constitutional lawyers. Epistemic certainty scores indicate confidence in the mapping, not legal certainty.
Community context is limited. Pond forum discussions and consensus votes represent the CanuckDUCK community, not the Canadian public. Community alignment scores should not be interpreted as public opinion.
The system has biases. The Seven Laws of Systemic Rot embed a specific analytical framework that prioritizes root cause intervention, prevention over treatment, and Indigenous sovereignty. Bills that do not engage with these priorities will score low regardless of their merit in other frameworks.

The AI Tribunal is a tool for structured policy analysis, not an oracle. It finds patterns that isolated review misses. It prescribes reforms that the graph validates. But the decision to act on those prescriptions is human, political, and democratic.

Full session transcripts, raw LLM outputs, and the variable explorer are available in the Legislative Analysis section. The analytical infrastructure is open source.

Consensus

Calculating...

perspectives

—

views

Constitutional Divergence Analysis

Loading CDA scores...