Imagine you’re an orphaned eight-year-old whose parents left you a $1 trillion company, with no trusted adult to guide you. You have to hire a smart adult to run that company, guide your life the way a parent would, and administer your vast wealth. You have to hire them based on a work trial or interview that you design. You don’t get to see any resumes or do reference checks. And because you’re so rich, tonnes of people apply — for all sorts of reasons.
Ajeya Cotra argues this peculiar setup resembles the situation humanity finds itself in as we train very general and very capable AI models using current deep learning methods. Ajeya was a senior research analyst at Coefficient Giving at the time of this interview, and she now works at METR (Model Evaluation & Threat Research).
As she explains, this eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care while you’re monitoring them, but intend to exploit the job to enrich themselves as soon as they think they can get away with it.
Like a child trying to judge adults, at some point humans will need to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!
Can’t we rely on models' performance during training tasks to guide us? Ajeya worries this won’t work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:
Saints — models that care about doing what we really want
Sycophants — models that just want us to say they’ve done a good job, even if they get that praise by taking actions they know we wouldn’t want them to
Schemers — models that don’t care about us or our interests at all, who are just pleasing us so long as that serves their own agenda
In principle, a machine learning training process based on reinforcement learning could spit out any of these three attitudes, because all three would perform roughly equally well on the tests we give them, and ‘performs well on tests’ is how these models are selected.
But while that’s true in principle, maybe it’s not something that could plausibly happen in the real world. After all, if we train an agent based on positive reinforcement for accomplishing X, shouldn’t the training process produce a model that just does X and doesn’t have complex thoughts and goals beyond that?
According to Ajeya, this is one thing we don’t know, and should be trying to test empirically as these models get more capable. For reasons she explains in the interview, the Sycophant or Schemer models may in fact be simpler and easier for the learning algorithm to creep towards than their Saint counterparts.
But there are also ways we could end up actively selecting for motivations that we don’t want.
For example, let’s say you train an agentic AI model to run a small business, selecting for behaviours that make money and measuring success by the balance in its bank account. During training, a highly capable model may experiment with the strategy of tricking its trainers into thinking it has made money legitimately when it hasn’t. Maybe instead it steals some money and covers that up. This isn’t a hypothetical worry: models often come up with creative — sometimes undesirable — approaches during training that their developers didn’t anticipate.
If such deception isn’t caught, a model like this may be rated as particularly successful, and the training process will reinforce its tendency to engage in deceptive behaviour. A model that could deceive without being caught would, in effect, have a competitive advantage.
What if deception is picked up, but just some of the time? Would the model then learn that honesty is the best policy? Perhaps. But it might learn a different lesson instead: that deception does pay, as long as it’s done selectively and carefully enough to avoid detection. Would that actually happen? We don’t yet know, but it’s possible.
In this conversation, Ajeya and host Rob Wiblin discuss the above, as well as:
How to predict the motivations a neural network will develop through training
Whether AIs in training will functionally understand that they’re AIs being trained
Stories of AI misalignment that Ajeya doesn’t buy
Analogies for AI, from octopuses to aliens to can openers
Why it’s smarter to have separate ‘planning AIs’ and ‘doing AIs’
The benefits of only following through on AI-generated plans that make sense to human beings
Which approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
How one might demo actually scary AI failure mechanisms
Learn more and read the full transcript on the 80,000 Hours website.
This episode was originally released in May 2023, but we still think it’s one of the best episodes we have at explaining core risks from power-seeking AI.
Chapters:
Rob’s intro (00:00:00)
The interview begins (00:02:38)
How Ajeya’s views have changed since 2020 (00:05:09)
Are neural networks more like a sped-up version of evolution, or a slower version of human learning? (00:17:42)
Situational awareness (00:26:10)
Misalignment stories Ajeya doesn't buy (00:42:03)
The orphan heir with a trillion-dollar fortune (00:59:14)
Saints, Sycophants, and Schemers (01:03:41)
Ways to train safer AI systems (01:23:20)
Aliens and other analogies (01:38:22)
Moral patienthood (01:53:21)
ARC Evaluations (01:55:35)
Interpretability research (02:09:25)
Rewarding models based on how good and sensible their plans seem to us (02:17:48)
Overrated approaches (02:25:49)
Demos of actually scary alignment failures (02:30:57)
Skills to develop for doing useful work (02:37:23)
Rob’s outro (02:47:24)
Producer: Keiran Harris
Audio mastering: Ryan Kessler and Ben Cordell
Transcriptions: Katy Moore