Alignment Research Center (Page 2)

Formal verification, heuristic explanations and surprise accounting

ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural … »

Matrix completion prize results

Earlier this year ARC posted a prize for two matrix completion problems. We received a number of submissions we considered useful, but not any complete solutions. We are closing the contest and awarding … »

ARC is hiring theoretical researchers

The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here. Update January 2024: we have paused hiring and expect to reopen … »

Prizes for matrix completion problems

Here are two self-contained algorithmic questions that have come up in our research. We're offering a bounty of $5k for a solution to either of them—either an algorithm, or a … »

Can we efficiently distinguish different mechanisms?

This post is an elaboration on “tractability of discrimination” as introduced in section III of "Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see "Mechanistic anomaly detection" and "Finding gliders in the game of life". … »

Can we efficiently explain model behaviors?

Finding explanations is a relatively unambitious interpretability goal. If it is intractable then that’s an important obstacle to interpretability in general. If we formally define “explanations,” then finding them is a well-posed search problem and there is a plausible argument for tractability. … »

Finding gliders in the game of life

ARC’s current approach to ELK is to point to latent structure within a model by searching for the “reason” for particular correlations in the model’s output. In this post we’ll walk through a very simple example of using this approach to identify gliders in the game of life. … »