A bird's eye view of ARC's research
Over the last few months, ARC has released a number of pieces of research. While some of these can be independently motivated, there is also a more unified research vision behind them. The purpose of this post is to try to convey some of that vision and how our individual pieces of research fit into it.
Thanks to Ryan Greenblatt, Victor Lecomte, Eric Neyman, Jeff Wu and Mark Xu for helpful comments.
A bird's eye view
To begin, we will take a "bird's eye" view of ARC's research using an interactive diagram.[1] As you "zoom in", more nodes will become visible and the explanation below will update to explain the new nodes.
An arrow in the diagram expresses that solving one problem should help solve another, but it varies from case to case whether subproblems combine "conjunctively" (all subproblems need to be solved to solve the main problem) or "disjunctively" (a solution to any subproblem can be used to solve the main problem). ↩︎
The term "alignment robustness" comes from this summary of this post, and is synonymous with "objective robustness" in the terminology of this post. A slightly more formal variant is "high-stakes alignment", as defined in this post. ↩︎
How ARC's research fits into this picture
We will now explain how some of ARC's research fits into the above diagram at the most zoomed in level. For completeness, we will cover all of ARC's most significant pieces of published research to date, in chronological order. Each piece of work has been labeled with the most closely related node from the diagram, but often also covers nearby nodes and the relationships between them.
Eliciting latent knowledge: How to tell if your eyes deceive you defines ELK, explains its importance for scalable alignment, and covers a large number of possible approaches to ELK. Some of these approaches are somewhat related to heuristic explanations, but most are alternatives that we are no longer pursuing.
Formalizing the presumption of independence lays out the problem of devising a formal notion of heuristic explanations, and makes some early inroads into this problem. It also includes a brief discussion of the motivation for heuristic explanations and the application to alignment robustness and ELK.
Mechanistic anomaly detection and ELK and our other late 2022 blog posts (1, 2, 3) explain the approach to mechanism distinction that we currently find the most promising, mechanistic anomaly detection (MAD). They also cover how mechanism distinction could be used to address alignment robustness and ELK, how heuristic explanations could be used for mechanism distinction, and the feasibility of finding heuristic explanations.
Formal verification, heuristic explanations and surprise accounting discusses the high-level motivation for heuristic explanations by comparing and contrasting them to formal verification for neural networks (as explored in this paper) and mechanistic interpretability. It also introduces surprise accounting, a framework for quantifying the quality of a heuristic explanation, and presents a draft of empirical work on heuristic explanations.
Backdoors as an analogy for deceptive alignment and the associated paper Backdoor defense, learnability and obfuscation discuss a formal notion of backdoors in ML models and some theoretical results about it. This serves as an analogy for the subdiagram Heuristic explanations → Mechanism distinction → Alignment robustness. In this analogy, alignment robustness corresponds to a model being backdoor-free, mechanism distinction corresponds to the backdoor defense, and heuristic explanations correspond to so-called "mechanistic" defenses. The blog post covers this analogy in more depth.
Estimating Tail Risk in Neural Networks lays out the problem of low probability estimation, how it would help with alignment robustness, and possible approaches to LPE based on heuristic explanations. It also presents a draft describing an approach to heuristic explanations based on analytically learning variational autoencoders.
Towards a Law of Iterated Expectations for Heuristic Estimators and the associated paper discuss a possible coherence property for heuristic explanations as part of the search for a formal notion of heuristic explanations. It also provides a semi-formal account of how heuristic explanations could be applied to low probability estimation and mechanism distinction.
Low Probability Estimation in Language Models and the associated paper Estimating the Probabilities of Rare Outputs in Language Models describe an empirical study of LPE in the context of small transformer language models. The method inspired by heuristic explanations outperforms naive sampling in this setting, but does not outperform methods based on red-teaming (searching for inputs giving rise to the rare behavior), although there remain theoretical cases where red-teaming fails.
Further subproblems
ARC's research can be subdivided further, and we have been putting significant effort into a number of subproblems not explicitly mentioned above. For instance, our work on heuristic explanations includes both work on formalizing heuristic explanations (devising a formal framework for heuristic explanations) and work on finding heuristic explanations (designing efficient search algorithms for them). Some subproblems of these include:
- Measuring quality: "surprise accounting" offers a potential way to measure the quality of a heuristic explanation, which is important for being able to search for high-quality explanations. However, it is currently an informal framework with many missing details.
- Capacity allocation: it will probably be too challenging to find high-quality explanations for every aspect of a model's behavior. Instead, we can try to tailor explanations towards behaviors with potentially catastrophic consequences. A good loss function for heuristic explanations should push for quality only where it is relevant to the behavior at hand.
- Cherry-picking: if we use a heuristic explanation to estimate something (as in low probability estimation), we need to make sure that the way in which we find the explanation doesn't systematically bias the estimate.
- Form of representation: one form that a heuristic explanation could take is of an "activation model", i.e. a probability distribution over a model's internal activations. However, we may also need to represent explanations that do not correspond to any particular probability distribution.
- Formal desiderata: we can attempt to formalize heuristic explanations by considering properties that we think they should satisfy, and seeing if those properties can be satisfied.
- No-coincidence principle: in order for heuristic explanations to work in the worst case, we need every possible behavior to be amenable to explanation. We sometimes refer to this desideratum as the "no-coincidence principle" (a term taken from this paper). Counterexamples to this principle could present obstacles to our approach.
- Empirical regularities: some model weights may have no explanation beyond being tuned to match some empirical average, either because the input distribution is defined empirically, or because of an emergent regularity in a formally-defined system (such as the relative value of a queen and a pawn in chess). A good notion of heuristic explanations should be able to deal with these.
Conclusion
We have painted a high-level picture of ARC's research, explained how our published research fits into it, and briefly discussed some additional subproblems that we are working on. We hope this provides people with a clearer sense of what we are up to.
Cross-postings for comments: LessWrong, AlignmentForum