Early Indicators of Reward Hacking via Reasoning Interpolation
Using importance sampling with fine-tuned donor prefills to predict reward hacking emergence during training
Using importance sampling with fine-tuned donor prefills to predict reward hacking emergence during training
Interim report on ongoing work on reward hacking
Announcing Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
Adding attention to linear probes
Research update on on applying local volume measurement to downstream tasks
In this post, we will study inductive biases of the parameter-function map of random neural networks using star domain volume estimates. This builds on the ideas…
Announcing the Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Using Product Key Memories to encode sparse coder features
In this post, we show that when two TopK SAEs are trained on the same data, with the same batch order but with different random initializations,…
Using interpretations of SAE latents to simulate activations.