News from Alignment Forum

Alignment Forum Communities 1 day ago

LLM-Driven Feature Discovery

We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we…

Alignment Forum Communities 3 days ago

How transparent is DiffusionGemma (and why it matters)

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel…

Alignment Forum Communities 6 days ago

GDM AI Control Roadmap

GDM has published an AI Control Roadmap! From the executive summary:We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal…

Alignment Forum Communities June 16, 2026

Predicting LLM Safety Before Release by Simulating Deployment

Paper linkBefore releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use,…

Alignment Forum Communities June 16, 2026

Synthetic document finetuning for instilling positive traits

This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth…

Alignment Forum Communities June 14, 2026

Why Do Naive SFT Filters For Safety Properties Fail?

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third…

Alignment Forum Communities June 13, 2026

SFT Drives Gemini’s Safety Properties

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second…

Alignment Forum Communities June 12, 2026

Building and evaluating model diffing agents

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first…

Alignment Forum Communities June 12, 2026

Sympathy for both sides of the egregious misalignment debate

On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control,…

Alignment Forum Communities June 11, 2026

Models May Behave Worse When Eval Aware

This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.TL;DRIt's often assumed that…

Latest