Dario Amodei (Anthropic CEO) - Scaling, Alignment, & AI Progress

August 08, 2023 1h 58m

Here is my conversation with Dario Amodei, CEO of Anthropic.

Dario is hilarious and has fascinating takes on what these models are doing, why they scale so well, and what it will take to align them.

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.

Timestamps

(00:00:00) - Introduction

(00:01:00) - Scaling

(00:15:46) - Language

(00:22:58) - Economic Usefulness

(00:38:05) - Bioterrorism

(00:43:35) - Cybersecurity

(00:47:19) - Alignment & mechanistic interpretability

(00:57:43) - Does alignment research require scale?

(01:05:30) - Misuse vs misalignment

(01:09:06) - What if AI goes well?

(01:11:05) - China

(01:15:11) - How to think about alignment

(01:31:31) - Is modern security good enough?

(01:36:09) - Inefficiencies in training

(01:45:53) - Anthropic’s Long Term Benefit Trust

(01:51:18) - Is Claude conscious?

(01:56:14) - Keeping a low profile

Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Download Audio Original Episode

Listen and Follow Along

Speed:

Full Transcript

A generally well-educated human. That could happen in, you know, two or three years.
What does that imply for Anthropic when in two to three years, these Leviathans are doing like $10 billion training runs? The models, they just want to learn. And it was a bit like a Zen Cohen.
I listened to this and I became enlightened. The compute doesn't flow, like the spice doesn't flow.
It's like, you can't like, like the blob has to be unencumbered, right? The big acceleration that happened late last year and beginning of this year, we didn't cause that. And honestly, I think if you look at the reaction of Google, that that might be 10 times more important than anything else.
There was a running joke. The way building AGI would look like is, there would be a data center next to a nuclear power plant next to a bunker.
But now it's 2030. What happens next? What are we doing with a superhuman god? Okay.
Today, I have the pleasure of speaking with Dario Amodei, who is the CEO of Anthropic. And I'm really excited about this one.
Dario, thank you so much for coming on the podcast. Thanks for having me.
First question. You have been one of the very few people who have seen scaling coming for years, more than five years.
I don't know how long it's been. But as somebody who's seen it coming, what is fundamentally the explanation for why scaling works? Why is the universe organized such that if you throw big blobs and compute at a wide enough distribution of data, the thing becomes intelligent? I think the truth is that we still don't know.
I think it's almost entirely an empirical fact. You know, I think it's a fact that you could kind of sense from the data and from a bunch of different places.
But I think we don't still have a satisfying explanation for it. If I were to try to make one, but I'm just, I don't know, I'm just kind of waving my hands when I say this.
You know, there's this, there's these ideas in physics around like long tail or power law of like correlations or effects. And so like when a bunch of stuff happens, right, when you have a bunch of like features, you get a lot of the data in like kind of the early, you know, the fat part of the distribution before the tails, you know, for language, this would be things like, oh, I figured out there are parts of speech and nouns follow verbs.
And then there are these more and more and more and more subtle correlations. And so it kind of makes sense why there would be this, you know, every log or order of magnitude that you add, you kind of capture more of the distribution.
What's not clear at all is why does it scale so smoothly with parameters? Why does it scale so smoothly with the amount of data? You can think up some explanations of why it's linear. The parameters are like a bucket, and so the data is like water, and so size of the bucket is proportional to size of the water.
But why does it all these, this very smooth scaling? I think we still don't know. There's all these explanations.
Our chief scientist, Jared Kaplan, did some stuff on like fractal manifold dimension that like you can use to explain it. So there's all kinds of ideas, but I feel like we just don't really know for sure.
And by the way, for the audience who's trying to follow along, by scaling, we're referring to the fact that you can very predictably see how if you go from GBD3 to GBD4, or in this case, Claw 1 to Claw 2, that the loss in terms of whether it can predict the next token scales very smoothly. So, okay, we don't know why it's happening, but can you at least predict if empirically, here is the loss at which this ability will emerge, here is the place where this circuit will emerge.
Is that at all predictable? Or are you just looking at the loss number? That is much less predictable. What's predictable is this statistical average, this loss, this entropy.
It's super predictable. It's like, you know, predictable to like, sometimes even to several significant figures, which you don't see outside of physics, right? You don't expect to see it in this messy empirical field.
But actually, specific abilities are very hard to predict. So, you know, back when I was working on GPT-2 and GPT-3, like, when does arithmetic come in place? When do models learn to code? Sometimes it's very abrupt.
You know, it's kind of like you can predict statistical averages of the weather, but the weather on one particular day is very hard to predict. So dumb it down for me.
I don't understand manifolds, but mechanistically, it doesn't know addition yet. Now it knows addition.
What has happened? This is another question that we don't know the answer to. I mean, we're trying to answer this with things like mechanistic interpretability.
But I'm not sure. I mean, you can think about these things about like circuits snapping into place.
Although there is some evidence that when you look at the models being able to add things that, you know, like if you look at its chance of getting the right answer, that shoots up all of a sudden. But if you look at, okay, what's the probability of the right answer? You'll see it climb from like one in a million to one in a hundred thousand to one in a thousand long before it actually gets the right answer.
And so there's some, in many of these cases, at least, I don't know if in all of them, there's some continuous process going on behind the scenes. I don't understand it at all.
Does that imply that the circuit or the process for doing addition was preexisting and it just got increased in salience? I don't know if like there's this circuit that's weak and getting stronger. I don't know if it's something that works, but not very well.
Like I think we don't know. And these are some of the questions we're trying to answer with mechanistic interpretability.
Are there abilities that won't emerge with scale? So I definitely think that again, like things like alignment and values are not guaranteed to emerge with scale, right? It's kind of like, you know, one way to think about it is you train the model and it is basically, it's like predicting the world. It's understanding the world.
It's job is facts, not values, right? It's trying to predict what comes next. But there's just, there's free variables here where it's like it what should you do what

should you think what should you value those you know like the they're just there aren't the bits

for that there's just like well if i started with this i should finish with this if i started with

this other thing i should finish with this other thing um and so i think that's not going to emerge

i want to talk about alignment in a second but on scaling if it turns out that scaling plateaus

before we reach human level intelligence looking back on on it, what would be your explanation? What do you think is likely to be the case if that turns out to be the outcome? Yeah. So I guess I would distinguish some problem with the fundamental theory with some practical issue.
So one practical issue we could have is we could run out of data. For various reasons, I think that's not going to happen.
But, you look at it very naively, we're not that far from running out of data. And so it's like we just don't have the data to continue the scaling curves.
I think another way it could happen is like, oh, we just use up all of our compute that was available and that wasn't enough. And then progress is slow after that.
I wouldn't bet on either of those things happening, but they could. I think from a fundamental perspective, personally, I think it's very unlikely that the scaling laws will just stop.
If they do, another reason, again, this isn't fully fundamental, could just be we don't have quite the right architecture. Like if we tried to do it with an LSDM or an RNN, the slope would be different.
I still might be that we get there, but I think there are some things that are just very hard to represent when you don't have this ability to attend far in the past that transformers have. If somehow, and I don't know how we would know this, it kind of wasn't about the architecture and we just hit a wall, I think I'd be very surprised by that.
I think we're already at the point where the things the models can't do don't seem to me to be different in kind from the things they can do. And it just, you know, you could have made a case a few years ago that it was like, they can't reason, they can't program.
Like you could have drawn boundaries and said, well, maybe you'll hit a wall.

I didn't think that I didn't think we would hit a wall. A few other people didn't think we would

hit a wall. But it was a more plausible case that I think it's a less plausible case now.
Now,

it could happen like this stuff is crazy. Like it could it could it could happen tomorrow that it's

just like, we hit a wall. I think if that happens, I'm trying to think of like, what's my what would

really be my it's unlikely, but what would really be my explanation? I think my explanation would be

Thank you. I think if that happens, I'm trying to think of like, what's my, what would really be my, it's unlikely, but what would really be my explanation? I think my explanation would be there's something wrong with the loss when you train on next word prediction.
Like some of the remaining like reasoning abilities or something like that. Like if you really want to learn, you know, it's a program at a really high level.
Like it means you care about some tokens much more than others. And they're rare enough that it's like the loss function over focuses on kind of the, the, the appearance, the things that are responsible for the most bits of entropy.
Uh, and instead, you know, they don't focus on this stuff that's really essential. And so you could kind of have the signal drowned out in the noise.
I don't think it's going to play out that way for a number of reasons but if if you told me yep you trained your 2024 model it was much bigger and it just wasn't any better and you tried every architecture and didn't work that i think that's the explanation i would i would reach for is there a candidate for another loss function if you had to abandon next token prediction i think then you would have to go for some kind of rlL. And again, there's, you know, there's many different kinds.
There's RL from human feedback. There's RL against an objective.
There's things like constitutional AI. There's things like amplification and debate, right? These are kind of both alignment methods and ways of training models.
You would have to try a bunch of things, but the focus would have to be on what do we actually care about the model doing, right? In in a sense we're a little bit lucky that it's like predict the next word gets us all these other things we need right there's no guarantee it seems like from your worldview there's a multitude of different loss functions that it's just a matter of what can allow you to just throw a whole bunch of data at it like the next token prediction itself is not significant yeah well i mean i guess the thing with rl is you get slowed down a bit because it's like you know you have to by some method kind of you know design how the loss function works nice thing with the next token prediction is it's there for you right it's just there it's the easiest thing in the world and so i think it would slow you down if you couldn't scale in just that very simplest way you mentioned that uh the data is likely not to be the constraint. Why do you think that is the case? There's various possibilities here.
And, you know, for a number of reasons, I shouldn't go into the details, but, you know, like there's many sources of data in the world and there's many ways that you can also generate data. My guess is that this will not be a blocker.
Maybe it'd be better if it was, but it won't be. Are you talking about multimodal or? There's just many different ways to do it.
How did you form your views on scaling? How far back can we go? And then you would be basically saying something similar to this. This view that I have probably formed gradually from, I would say like 2014 to 2017.
So I think my first experience with it was my first experience with AI. So I, you know, I saw some of the early stuff around AlexNet in 2012, always kind of had wanted to study intelligence.
But I, you know, before I was just like, this isn't really working. Like, it doesn't seem like it's actually working.
You know, all the way back to like, you know, 2005, I'd like, you know, I'd read Ray Kurzweil's work, you know, I'd read even some of like Eliezer's work on the early internet back then. And I was like, oh, this stuff kind of looks far away.
Like I look at the AI stuff of today and it's like not anywhere close. But with Alex and I was like, oh, this stuff is actually starting to work.
So I joined Andrew Ng's group initially at Baidu. And the first task, you know, that I got set to do, right, it was my, you know, I'd been in a different field.
And so I first joined, you know, this was my first experience with AI. And it was a bit different from a lot of the kind of academic style research that was going on kind of elsewhere in the world, right? I think I kind of got lucky in that the task that was given to me and the other folks there was just make the best speech recognition system that you can.
And there was a lot of data available. There were a lot of GPUs available.
So it kind of, it posed the problem in a way that was amenable to discovering that kind of scaling was a solution, right? That's very different from like, you're a postdoc and it's your your job to come up with, you know, what's the what's the best, like, you know, what's what's an idea that seems clever and new and makes your mark as someone who's invented something. And, and so I just quickly discovered that, like, you know, I was just just tried the simplest experiments.
I was like, you know, just fiddling with some dials. I was like, okay, try, you know, try, try adding more layers to the literally add more layers to the RNN, you know, just fiddling with some dials.
I was like, okay, try, you know, try adding more layers to the, literally add more layers to the RNN.

You know, try training it for longer.

What happens?

How long does it take to overfit?

What if I add new data and repeat it less times?

And like, I just saw these like very consistent patterns.

I didn't really know that this was unusual or that others weren't thinking in this way. This was just kind

of like, almost like beginner's luck. It was my first experience with it.
And I didn't really think about it beyond speech recognition, right? You know, I was just kind of like, oh, this is, you know, I don't know anything about this field. There are zillions of things people do with machine learning.
But like, I'm like, weird, this seems to be true in the speech recognition field. and then I think it was recently

just before OpenAI started that I met Ilya, who you interviewed. One of the first things he said to me was, look, the models, they just want to learn.
You have to understand this. The models, they just want to learn.
And it was a bit like a Zen Cohen. I listened to this and I i became enlightened um and uh you know over over the years over the years after this you know you know again i i would be kind of you know the one who would formalize a lot of these things and kind of put them together but like just kind of the the what that told me is that that phenomenon that i'd seen wasn't just some random thing that I'd seen.
It was like it was broad. It was more general, right? The models just want to learn.
You get the obstacles out of their way, right? You give them good data. You give them enough space to operate in.
You don't do something stupid like condition them badly numerically. And they want to learn.
They'll do it. They'll do it.
You know, what I find really interesting about what you said is there were many people who were aware back at that time probably weren't working on it directly, but we're aware that these things are really good at speech recognition or at playing these constrained games. Very few extrapolated from there like you and Ilya did to something that is generally intelligent.
What was different about the way you were thinking about it versus how others think that you went from like, it's getting better at speech in this consistent way, it will get better at everything in this consistent way? Yeah. So I genuinely don't know.
I mean, at first, when I saw it for speech, I assumed this was just true for speech or for this narrow class of models. I think it was just over the period between 2014 and 2017, I tried it for a lot of things and saw the same thing over and over again.
I watched the same being true with Dota. I watched the same being true with robotics, which many people thought of as a counterexample, but I just thought, well, it's hard to get data for robotics.
But if we operate within, if we look within the data that we have, we see the same patterns. And so I don't know.
I think people were very focused on solving the problem in front of them. Why one person thinks one way, another person thinks, it's very hard to explain.
I think people just see it through a different lens, you know, are looking like vertically instead of horizontally. They're not thinking about the scaling.
They're thinking about how do I solve my problem? And well, for robotics, there's not enough data. And so, you know, and so, you know, that can easily abstract you.
Well, scaling doesn't work because we don't have the data. And so I don't, I don't know.
I just, for some reason, and it may just, it may just have been random chance, was obsessed with that particular direction. When did it become obvious to you that language is the means to just feed a bunch of data into these things that, or was it just, you ran out of other things like robotics, there's not enough data.
This other thing, there's not enough data. Yeah.
I mean, I think this whole idea of like the next word prediction that you could do self-supervised learning, you know, that together with the idea that it's like, wow, for predicting the next word, there's so much richness and structure there, right? You know, it might say two plus two equals and you have to know the answer is four. And, you know, it might be telling the story about a character.
And then basically it's posing to the model, you know, the equivalent of these developmental tests that get posed to children. You know, Mary walks into the room and, you know, puts an item in there.
And then, you know, Chuck walks into the room and removes the item and Mary doesn't see it. What does Mary think happen? You know, so like, so the models are going to have to get this right in the service of predicting the next word, they're going to have to solve, you know, solve all these theory of mind problems, solve all these math problems.
And so I, you know, I, my thinking was just, well, you know, you scale it up as much as you can. You, you, you know, there's, there's kind of no limit to it.
And I think I kind of had abstractly that view, but the thing of course, that like really solidified and convinced me was the work that Alec Radford did on GPT one, which was not only could you get this, this language model that could predict things, but also you could fine tune it. You needed to fine tune it in those days to do all these other tasks.
And so I was like, wow, you know, this isn't just some narrow thing where you get the language model, right? It's sort of halfway to everywhere, right? It's like, you know, you get the language model, right? And then with a little move in this direction, it can, you know, it can solve this, this, you know, logical dereference test or whatever.

And, you know, with this, this other thing, you know, it can, it can solve translation or something. And then you're like, wow, I think there's, there's really something to do it.
And of course we can, we can really scale it. Well, one thing that's confusing or that would have been hard to see if you told me in 2018, we'll have models in 2023, like Quotla 2,

that can write theorems in the style of Shakespeare,

whatever theory you want.

They can A standardized tests with open-ended questions,

you know, just all kinds of really impressive things.

You would have said at that time,

I would have said, oh, you have AGI.

You clearly have something

that is a human level intelligence.

While these things are impressive,

it clearly seems we're not at human level,

at the end. at that time, I would have said, oh, you have AGI.
You clearly have something that is a human level intelligence where these, while these things are impressive, it clearly seems we're not at human level, at least in the current generation and potentially for generations to come. What explains this discrepancy between super impressive performance in these benchmarks and in just like the things you could describe versus, yeah, generally.
So that, that was one area where actually I was not pressing and I was surprised as well yeah um so when i first looked at gpt3 and you know more so the kind of things that we built in the early days at at anthropic my my general sense was i you know i looked at these and i'm like it seems like they they really grasped the essence of language i'm not sure how much we need to scale them up like maybe we we, maybe what's, what's more needed from here is like RL and all, and kind of, and kind of all the other stuff. Like we might be kind of near the, you know, I thought in 2020, like we can scale this a bunch more, but I wonder if it's more efficient to scale it more or to start adding on these other objectives, like, like RL.
I thought maybe if you do as much RL as, you done pre-training for a 2020 style model, that that's the way to go and scaling it up will keep working. But is that really the best path? And I think it, I don't know, it just keeps going.
I thought it had understood a lot of the essence of language, but then, you know, there's, there's kind of, there's kind of further to go. And, and so I don't know, stepping back from it, like one of the reasons why I'm sort of very empiricist about, about AI, about safety, about organizations is that you often get surprised, right? I feel like I've been right about some things, but I've still, with these theoretical pictures ahead, been wrong about most things.
Being right about 10% of the stuff sets you head and shoulders above many people. If you look back to, I can't remember who it was, kind of made these diagrams that are like, here's the village idiot, here's Einstein, here's the scale of intelligence, right? And the village idiot and Einstein are like very close to each other.
Like that, maybe that's still true in some abstract sense or something, but it's not really what we're seeing, is it? We're seeing like that it seems like the human range is pretty broad and doesn't, we don't hit the human range in the same place or at the same time for different tasks, right? Like, you know, like, write a sonnet, you know, in the style of Cormac McCarthy or something, like, I don't know, I'm not very creative, so I couldn't do that. But like, you know, that's a pretty high level human skill, right? And even the model is starting to get good at stuff of, you know, like constrained writing, you know, there's like write a, you know, write a page without using the letter E or something, write a page about X without using the letter E.
Like, I think the models might be like superhuman or close to superhuman at that. But when it comes to, you know, I don't know, prove relatively simple mathematical theorems, like they're just starting to do the beginning of it.
They make really dumb mistakes sometimes. And they really lack any kind of broad, like, you know, correcting your errors or doing some extended task.
And so I don't know, it turns out that intelligence isn't a spectrum. There are a bunch of different areas of domain expertise.
There are a bunch of different kinds of skills, like memory is different. I mean, it's all formed in the blob.
It's all formed in the blob. It's not complicated, but to the extent it even is on the spectrum, the spectrum is also wide.
If you asked me 10 years ago, that's not what I would have expected at all. But I think that's very much the way it's turned out.
Oh, man, I have so many questions just as follow up on that. One is, do you expect that given the distribution of training that these models get from massive amounts of internet data versus what humans got from evolution, that the repertoire of skills that elicits will be just barely overlapping.
It will be like concentric circles. How do you think about, do those matter? Clearly there's a large, there's certainly a large amount of overlap, right? Because a lot of the things, you know, like these models have business applications and many of their business applications are doing things that, you know, are helping humans to be more effective at things.
So the overlap is quite large. And, you know, if you think of all the activity that humans put on the internet in text, that covers a lot of it.
But it probably doesn't cover some things. Like the models, I think they do learn a physical model of the world to some extent, but they certainly don't learn how to actually move around in the world.
Again, maybe that's easy to fine tune. But I, you know,

I think so. I think there are some things that the models don't learn that humans do.
And then I think, you know, the models learn, for example, to speak fluent base 64. I don't know about you,

but I never learned that. Right.
How likely do you think it is that these models will be super

human for many years at economically valuable tasks while they are still below humans in many other relevant tasks that prevents like an intelligence explosion or something. I think this kind of stuff is like really hard to know.
So I'll give I'll give that caveat that like, you know, again, like the basic scaling laws you can kind of predict. And then like this more granular stuff, which we really want to know to know how this all is going to go is much harder to know.
But my guess would be the scaling laws are going to continue. You know, again, subject to, you know, do people slow down for safety or for regulatory reasons? But, you know, let's just put all that aside and say, like, we have the economic capability to keep scaling.
If we did that, what would happen? And I think my view is we're going to keep getting better across the board. And I don't see any area where the models are, like, super, super weak or not starting to make progress.
Like, that used to be true of, like, math and programming. But I think over the last six months, you know, the 2023 generation of models compared to the 2022 generation has started to learn that.
There may be more subtle things we don't know. And so I kind of suspect, even if it isn't quite even, that the rising tide will lift all the boats.
Does that include the thing you were mentioning earlier where if there's an extended task, it kind of loses its train of thought or its ability to just like execute a series of steps. So I think that that's going to depend on things like RL training to have the model do longer horizon tasks.
I don't expect that to require a substantial amount of additional compute. I think that that was probably an artifact of, yeah, kind of thinking about RL in the wrong way and underestimating how much the model had learned on its own.
In terms of, you know, are we going to be superhuman in some areas and not others? I think it's complicated. I could imagine that we won't be superhuman in some areas because, for example, they involve like embodiment in the physical world.
And then it's like, what happens? Like do the AIs help us train faster AIs and those faster AIs wrap around and solve that? Do you not need the physical world? It depends what you mean. Are we worried about an alignment disaster? Are we worried about misuse, like making weapons of mass destruction? Are we worried about the AI taking over research from humans? Are we worried about it reaching some threshold of economic productivity where it can do what the average...
These different thresholds, I think, have different answers. Although I suspect they will all come within a few years.
Let me ask about those thresholds. So if Claude was an employee at Anthropik, what salary would it be worth? is it like meaningfully speeding up ai progress it feels to me like an intern in most areas um but then some specific areas where it's better than that again i think one thing that makes the comparison hard is like the form factor is kind of like not the same as a human right like a human like you know if you were to behave like one of these chatbots, like we wouldn't really, I mean, I guess we could have this conversation.
It's like, but you know, they're, they're not really, they're more designed to answer single or a few questions. Right.
And, and like, you know, they don't have a, the concept of having a long life of prior experience. Right.
We're talking here about, you know, things that, that I've experienced in the past. Right.
And chatbots don't don't have that. And so there's there's all kinds of stuff missing.
And so it's hard to make a comparison. But I don't know.
They feel like interns in some areas and kind of then they have areas where they spike and are really savants where they may be better than they may be better than anyone here. But does the overall picture of something like an intelligence explosion, you know, my former guest is Carl Schumann and he has this like very detailed model of intelligence.
Does that, as somebody who would actually like see that happening, does that make sense to you as they go from interns to entry-level software engineers, those entry-level software engineers increase your productivity? I think the idea that the AI systems become more productive and first they speed up the productivity. I think the idea that the AI systems become more productive, and first they speed up the productivity of humans, then they kind of equal the productivity of humans.
And then they're, in some meaningful sense, the main contributor to scientific progress, that that happens at some point. I think that basic logic seems likely to me, although I have a suspicion that when we actually go into the details, it's going to be kind of like weird and different than we expect, that all the detailed models are kind of, you know, we're thinking about the wrong things or we're right about one thing and then are wrong about 10 other things and and

so i i don't know i think we might end up in like a weirder world than we expect when you add all this together like your estimate of when we get something kind of human level yeah what does that look like i mean again it depends on the thresholds yeah um you know in in terms of someone looks at these, the model, and, you know, even if you talk to it for, you know, for an hour or so, it's basically, you know, it's basically like a generally well-educated human. Yeah.
That could be not very far away at all, I think. Like, that could happen in, you know, two or three years.
Like, you know, if I look if i look at again like i think the main thing that would stop it would be if if we hit certain certain you know and we have internal tests for you know safety thresholds and stuff like that so if a company or the industry decides to slow down or you know we're able to get the government institute restrictions that kind of uh you know that that moderate the rate of for safety reasons. That would be the main reason it wouldn't happen.
But if you just look at the logistical and economic ability to scale, I don't think we're very far at all from that. Now, that may not be the threshold where the models are existentially dangerous.
In fact, I suspect it's not quite there yet. It may not be

the threshold where the models can take over most AI research. It may not be the threshold where

the models seriously change how the economy works. I think it gets a little murky after that,

and all those thresholds may happen at various times after that. But I think in terms of the

base technical capability of it, it kind of sounds like a reasonably generally educated human across the board. I think that could be quite close.
Why would it be the case that it could be, you know, pass a Turing test for an educated person, but not be able to contribute or substitute for human involvement in the economy? A couple reasons. One is just, you know, that the threshold of skill isn't high enough, right? Comparative advantage.
It's like, it like doesn't matter that, you know, I have someone who's better than the average human at every task. Like what I really need is like for AI research, like, you know, I need what, you know, I need to basically find something that is strong enough to substantially accelerate, you know, the like labor of the thousand experts who are best at it.
And so we might reach a point where we, you know, the comparative advantage of these systems is not great. Another thing that could be the case is that I think there are these kind of mysterious frictions that like, you know, kind of don't show up in naive economic models.
But you see it whenever you're like, you know, when you go to a customer or something, and you're like, hey, I have this cool chatbot. In principle, it can do everything that you know, your customer service bot does, or that this part of your company does.
But like, the actual friction of like, how do we slot it in? How do we make it work? That includes both kind of like, you know, just the question of how it works in a human sense within the company, like, you know, how things happen in the economy and overcome frictions. And also just like, what is the workflow? How do you actually interact with it? It's very different to say, here's a chatbot that kind of looks like it's doing this task that you're, you know, or helping the human to do some task as it is to say like, okay, this thing is deployed and 100,000 people are using it.
Often, like right now, lots of folks are rushing to deploy these systems. But I think in many cases, they're not using them in anywhere close to the most efficient way that they could.
You know, not because they're not smart, but because it takes time to work these things out. And so I think when things are changing this fast, there are going to be all of these frictions.
Yeah. And I think, again, these are messy reality that doesn't quite get captured in the model.
I don't think it changes the basic picture. Like, I don't think it changes the idea that we're building up this snowball of like, the models help the models get better and, you know, do what the humans and, you know, can accelerate what the humans do.
And eventually, it's mostly the models doing the work. Like, you zoom out far enough, that's happening.
But I'm kind of skeptical of kind of any kind of precise mathematical or exponential prediction of how it's going to be.

I think it's all going to be a mess.

But I think what we know is it's not a metaphorical exponential and it's going to happen fast.

How do those different exponentials net out, which we've been talking about?

So one was the scaling laws themselves are power laws with decaying marginal, you know, loss per, you know, parameter or something. The other exponential you talked about is, well, these things can get involved in the process of AI research itself, speeding it up.
So those two are sort of opposing exponentials. Does it net out to be super linear or sublinear? And also you mentioned, well, the distribution of intelligence might just be broader.
So should we expect after we get to this point in two to three years, it's like, whoom, whoom? Like, what does that look like? I mean, I think it's very unclear, right? So we're already at the point where if you look at the loss, the scaling laws are starting to bend. I mean, we've seen that in published model cards offered by multiple companies.
So that's not a secret at all. But as as they start to bend, each little bit of of entropy, right, of accurate prediction becomes more important, right? Maybe these last little bits, bits of entropy are like, well, you know, this is a physics paper as Einstein would have written it, as opposed to, you know, as some other physicist would have would have would have written it.
And so it's it's hard to assess significance from this. It certainly looks like in terms of practical performance, the metrics keep going up relatively linearly, although they're always unpredictable.
So it's hard to see that. And then, I mean, the thing that I think is driving the most acceleration is just more and more money is going into the field.
Like people are seeing that there's just a huge amount of, you know, of, of economic value. And so I expect the price, the amount of money spent on the largest models to go up by like a factor of a hundred or something.
And for that, that then to be concatenated with the chips are getting faster. The algorithms are getting better because there's so many people working on this now.

And so, again, I mean, you know, I'm not making a normative statement here. This is what should happen.
I'm not even saying this necessarily will happen because I think there's important safety and government questions here, which we're very actively working on. I'm just saying, like, left to itself, this is what the economy is going to do.
We'll get to those questions in a second. But how do you think about the contribution of Anthropic to that increasing in the scope of this industry where, I mean, there's an argument that, listen, with that investment, we can work on safety stuff at Anthropic.
Another that says you're raising the salience of this field in general. Yeah, I mean, it's all costs and benefits, right? The costs are not zero, right? So I think a mature way to think about these things is not to deny that there are any costs, but to think about what the costs are and what the benefits are.
I think we've been relatively responsible in the sense that the big acceleration that happened late last year and beginning of this year, we didn't cause that. We weren't the ones who did that.
And honestly, I think if you look at the reaction of Google, that that might be 10 times more important than, than anything else. And then kind of once it had happened, once the ecosystem had changed, then we did a lot of things to kind of, to kind of stay on the frontier.
And, and, and so I don't know, it's, it's, I mean, it's like any other question, right? It's like, you trying to you're trying to do the things that have the biggest costs and that have the lowest costs and the biggest benefits. And, you know, that that causes you to have different strategies at different times.
One question I had for you while we were talking about the intelligence stuff was, listen, as a scientist yourself, is it what do you make of the fact that these things have basically the entire corpus of human knowledge memorized?

And as far as I'm aware, they haven't been able to make like a single new connection that has led to a discovery. Whereas if even a moderately intelligent person had this much stuff memorized, they'd notice, oh, this thing causes this symptom.
This other thing also causes this symptom. You know, there's a medical cure right here, right? Shouldn't we be expecting that kind of stuff? I'm not sure.
I mean mean i think you know i don't know these words discovery creativity like it's one of the lessons i've learned is that in you know in kind of the big blob of compute often these these ideas often end up being kind of fuzzy and elusive and hard to track down but i think i think there is something here which is i think the models do display a kind of ordinary creativity creativity. Again, you know, the kind of like, you know, write a sonnet, you know, in the style of Cormac McCarthy or Barbie or something, you know, like there is some creativity to that.
And I think they do draw, you know, new connections of the kind that an ordinary person would draw. I agree with you that there haven't been any kind of like, I don't know, like I would say like big scientific discoveries.
I think that's a mix of like, just the model skill level is not high enough yet, right? Like I was on a podcast last week where the host said, I don't know, I play with these models. They're kind of mid, right? Like they get, you know, they get a B or a B minus or something.
And that I think is going to change with the scaling. I do think there's an interesting point about, well, the models have an advantage, which is they know a lot more than us.
You know, like should they have an advantage already, even if their skill level isn't quite high? Maybe that's kind of what you're getting at. I don't really have an answer to that.
I mean, it seems certainly like memorization and facts and drawing connections is an area where the models are ahead. And I do think maybe you need those connections and you need a fairly high level of skill.
I do think, particularly in the area of biology, for better and for worse, the complexity of biology is such that the current models know a lot of things right now. And that's what you need to make discoveries and draw.
It's not like physics where you need to, you know, you need to think and come up with a formula. And biology, you need to know a lot of things.
And so I do think the models know a lot of things and they have a skill level that's not quite high enough to put them together. And I think they are just on the cusp of being able to put these things together.
On that point, last week in your Senate testimony, you said that these models are two to three years away from potentially enabling large-scale biotourism attacks or something like that. Can you make that more concrete without obviously giving the kind of information that would...
But is it like one-shotting how to weaponize something? Or do you have to fine-tune an open-source model Like what would that actually look like? I think it'd be good to clarify this because we did a blog post in the Senate testimony. And like, I think various people kind of didn't understand the point or didn't understand what we'd done.
So I think today, and of course in our models, we try and prevent this, but there's always jailbreaks. You can ask the models all kinds of things about biology and get them to say all kinds of scary things.
Yeah. But often those scary things are things that you could Google.
And I'm, I'm therefore not particularly worried about that. I think it's actually an impediment to seeing the real danger where, you know, someone just says, oh, I asked this model, like, you know, for the small pot, you know, for to tell me some things about smallpox and it will, that, that is actually, you know, kind of not what I'm worried about.
So we spent about six months working with some of, basically some of the folks who are the most expert in the world on how to, how do biological attacks happen? You know, what, what would you need to conduct such an attack and how do we defend against such an attack? They worked very intensively on just the entire workflow of if I were trying to do a bad thing, it's not one shot.

It's a long process.

There are many steps to it.

It's not just like I asked the model for this one page of information.

And again, without going into any detail, the thing I said in the Senate testimony is like there are some steps where you can just get information on Google. There are some steps that are what I'd call missing.
They're scattered across a bunch of textbooks, or they're not in any textbook. They're kind of implicit knowledge.
And they're not really like, they're not explicit knowledge. They're more like, I have to do this lab protocol.
And like, what if I get it wrong? Oh, if this happens, then my temperature was too low. If that happened, I needed to add more of this particular reagent.
What we found is that for the most part, those key missing pieces, the models can't do them yet. But we found that sometimes they can.
And when they can, And sometimes they still hallucinate, which is the thing that's kind of keeping us safe. But we saw enough signs of the models doing those key things well.
And if we look at state-of-the-art models and go backwards to previous models, we look at the trend, it shows every sign of two or three years from now, we're going to have a real problem. Yeah.
Especially the thing you mentioned on the log scale, you go from like one in a hundred times, it gets it right to one in 10 to exactly. So, you know, I've seen many of these like groks in my life, right? I was there when I watched when GPT-3 learned to do arithmetic, when GPT-2 learned to do regression a little bit above chance, when, you know, when we got, you know, with Claude and we got better on like, you know, all these tests of helpful, honest, harmless.
I've seen a lot of Groks. This is unfortunately not one that I'm excited about, but I believe it's happening.
so somebody might say listen you were a co-author on this post that openai released about gpt2 where

they said you know we're not gonna release the weights or the details here because we're worried

that this model will be used for something, you know, bad. And looking back on it, now it's laughable to think that GPT-2 could have done anything bad.
Are we just like way too worried? This is a concern that doesn't make sense for... It is interesting.
It might be worth looking back at the actual text of that post. So I don't remember it exactly, but it should, you know, it's still up on the internet.
It says something like, you know, we're choosing not to release the weights because of concerns about misuse. But it also said, this is an experiment.
We're not sure if this is necessary or the right thing to do at this time, but we'd like to establish a norm of thinking carefully about these things. You know, you could think of it a little like the, you know, the Silamer conference in the 1970s, right? Where it's like, you know, they were just figuring out recombinant DNA.
You know, it was not necessarily the case that someone could do something really bad with recombinant DNA. It's just the possibilities were starting to become clear.
Those words, at least, were the right attitude. Now, I think there's a separate thing that like, you know, people don't just judge the post.
They judge the organization. Is this an organization that, you know, is produces a lot of hype or that has credibility or something like that? And so I think that had some effect on it.
I guess you could also ask, like, is it inevitable that people would just interpret it as like, you know, you can't get across any message more complicated than this thing right here is dangerous. So you can argue about those.
But I think the basic thing that was in my head and the head of others who were involved in that, and I think what is evident in the post is like, we actually don't know. We have pretty wide error of ours on what's dangerous and what's not.
So we should, you know, like we want to establish a norm of being careful. I think, by the way, we have enormously more evidence.
We've seen enormously more of these Groks now. And so we're well calibrated, but there's still uncertainty, right? In all these statements, I've said like, in two or three years, we might be there, right? There's a substantial risk of it.
And we don't want to take that risk. But, you know, I wouldn't say it's 100%.
It could be 50-50. Okay, let's talk about cybersecurity, which in addition to bio risk is another thing Anthropica has been emphasizing.
How have you avoided the cloud microarchitecture from leaking?

Because as you know, your competitors have been less successful at this kind of security.

Can't comment on anyone else's security.

Don't know what's going on in there.

A thing that we have done is, you know, so there are these architectural innovations,

right, that make training more efficient.

We call them compute multipliers because they're the equivalent of, you know, improving, you know, they're like having more compute. Our compute multipliers, again, I don't want to say too much about it because it could allow an adversary to counteract our measures, but we limit the number of people who are aware of a given compute multiplier to those who need to know about it.
And so there's a very small number of people who could leak all of these secrets. There's a larger number of people who could leak one of them.
But this is the standard compartmentalization strategy that's used in the intelligence community or resistant cells or whatever. So over the last few months, we implemented these measures.
So, you know, I don't want to jinx anything by saying, oh, this could never happen to us. But I think it would be harder for it to happen.
I don't want to go into any more detail. And, you know, by the way, I'd encourage all the other companies to do this as well.
It's as much as like competitors' architectures leaking is narrowly helpful to Anthropic, it's not good for anyone in the long run, right?

So security around this stuff is really important. Even with all the security you have, could you, with your current security, prevent a dedicated state-level actor from getting the claw-to-weights? It depends how dedicated is what I would say.
Our head of security who used to work on security for Chrome, which, you know, very widely used in attack application, he likes to think about it in terms of how much would it cost to attack Anthropic successfully. Again, I don't want to go into super detail of how much I think it will cost to attack, and it's kind of inviting people.
But like one of our goals is that it costs more to attack Anthropic than it costs to just train your own model, which doesn't guarantee things because, you know, of course you need the talent as well. So you might still, but, you know, but attacks have risks, the diplomatic costs, you know, and they use up the very sparse resources that nation state actors might have in order to do the attacks.
So we're not there yet, by the way, but I think we're to a very high standard compared to the size of company that we are. Like, I think if you look at security for most 150 person companies, like I think there's just no comparison.
But could we resist if it was a state actor's top priority to steal our model weights? No, they would succeed. How long does that stay true? Because at some point, the value keeps increasing and increasing.
And another part of this question is that what kind of a secret is how to train cloud three or cloud two? Is it with nuclear weapons, for example, we we have lots of spies you just take a blueprint across and that's the implosion device and that's what you need here is it just is it more tacit like the thing you're talking about biology you need to know how these reagents work is it just like you got the blueprint you got the micro architecture and the hyper parameters there are some things that are like you know a one-line equation and there are other things that are more complicated yeah um and i think compartmentalization is the the best way to do it just limit the number of people who know about something if you're a thousand person company and everyone knows every secret like one i guarantee you have some you have a leaker and two i guarantee you have a spy like a literal spy okay let's talk about alignment and let's talk about mechanistic interoperability which is the branch yes of which you um you guys specialize in while you're answering this question you might want to explain what mechanistic interoperability is but just um the broader question is mechanistically what is alignment is it that you're locking in the model into a benevolent character are you disabling deceptive circuits and procedures like what concretely is happening when align a model? I think as with most things, you know, when we actually train a model to be aligned, we don't know what happens inside the model, right? There are different ways of training it to be aligned, but I think we don't really know what happens. I mean, I think for some of the current methods, I think all the current methods that involve some kind of fine tuning, of course have the property that the underlying knowledge and abilities that we might be worried about don't disappear.

It's just, you know, the model is just taught not to output them.

I don't know if that's a fatal flaw or if, you know, or if that's just the way things have to be.

I don't know what's going on inside mechanistically.

And I think that's the whole point of mechanistic interpretability, to really understand what's going on inside the models at the level of individual circuits. Eventually when it's solved, what does the solution look like? Where, what is it the case where if you're clawed for, you do the mechanistic interpretability thing and you're like, I'm satisfied.
It's a line. What is it that you've seen? Yeah.
So I think, I think we don't know that yet. I think we don't know enough to know that yet.
I mean, I can give you a sketch for like what the process looks like as opposed to what the final result looks like. So I think verifiability is a lot of the challenge here, right? We have all these methods that purport to align AI systems and do succeed at doing so for today's tasks.
But then the question is always, if you had a more powerful model or if you had a model in a different situation, would it be aligned? And so I think this problem would be much easier if you had an Oracle that could just scan a model and say like, okay, I know this model is aligned. I know what it'll do in every situation.
Then the problem would be much easier. And I think the closest thing we have to that is something like mechanistic interpretability.
It's not anywhere near up to the task yet. But I guess I would say, I think of it as almost like an extended training set and an extended test set, right? Everything we're doing, all the alignment methods we're doing are the training set, right? You can, you can run tests in them, but will it really work out a distribution?

Will it really work in another situation? Mechanistic interpretability is the only thing that even in principle, and we're nowhere near there yet, but even in principle is the thing where it's like it's more like an x-ray of the model than a modification of the model, right? It's more like an assessment than an intervention. And so somehow we need to

get into a dynamic where we have an extended test set, an extended training set, which is all these alignment methods, and an extended test set, which is kind of like you x-ray the model and say like, okay, what worked and what didn't, in a way that goes beyond just the empirical tests that you've right um where've, that you've run, right. Um, where you're saying, what is the, what, what is the model going to do in these situations? What is it within its capabilities to do instead of what did it do phenomenologically? And of course we have to be careful about that, right? One of the things I think is very important is we should never train for interpretability because I think that is, that's taking away that advantage, right? You even have the problem, you know, similar to like validation versus test set, where like if you look at the x-ray too many times, you can interfere.
But I think that's a much weaker option. We should worry about that, but that's a much weaker process.
It's not automated optimization. We should just make sure as with validation and test sets that we don't look at the validation set too many times before running the test set.
But, you know, that's, again, that's more of a, that's manual pressure rather than automated pressure. And so some solution where it's like, we have some dynamic between the training and test set where it's like, we're trying things out and we really figure out if they work via way of testing them that the model isn't optimizing against some some orthogonal way like if I think of and I think we're never going to have a guarantee but some process where we do those things together again not in a stupid way there's lots of stupid ways to do this where you fool yourself but like some way to put extended training for alignment ability with extended testing for alignment ability together in a way that actually works i i still don't feel like i understand the intuition that why you think this is likely to work or this is a promising to pursue and let me ask the question in a sort of more specific way and excuse the tortured analogy but listen if, if you're an economist and you want to understand the economy, so you send a whole bunch of microeconomists out there.
And one of them studies how the restaurant business works. One of them studies how the tourism business works.
You know, one of them studies how the baking works. And at the end, they all come together.
And you still don't know whether there's going to be a recession in five years or not. Why is this not like that where you have an understanding of we understand how induction heads work in a two-layer transformer.
We understand, you know, modular arithmetic. How does this add up to does this model want to kill us? Like what does this model fundamentally want? A few things on that.
I mean, I think that's like the right set of questions to ask. I think what we're hoping for in the end is not that we'll understand every detail, but again, I would give like the x-ray or the MRI analogy that like we can be in a position where we can look at the broad features of the model and say like, is this a model whose internal state and plans are very different from what it externally represents itself to do, right? Is this a model where we're uncomfortable that, you know, far too much of its computational power is, you know, is devoted to doing what looked like fairly destructive and manipulative things? Again, we don't know for sure whether that's possible, but I think some at least positive signs that it might be possible, again, the model is not intentionally hiding from you, right? It might turn out that the training process hides it from you and you know i can think of cases where the model is really super intelligent it like thinks in a way so that it like affects its own cognition i suspect we should think about that we should consider everything i i i suspect that it may roughly work to think of the model as you know know, if it's trained in the normal way, just at, you know, at the just getting to just above human level, it may be a reason we should check.
It may be a reasonable assumption that the internal structure of the model is not intentionally optimizing against us. And I give an analogy like to humans.
So it's actually possible to, you know, to look at an MRI of someone and predict above random chance whether they're a psychopath. There was actually a story a few years back about a neuroscientist who was studying this.
And they looked at his own scan and discovered that he was a psychopath. And then everyone, everyone in his life was like, no, no, no, that's just obvious.
Like you're, you're a complete asshole. Like you must be a psychopath.
Um, and he was totally, totally unaware of this. The basic idea that, um, you know, that, that there, there can be these macro features that like, like psychopath is probably a good analogy for it.
Right. They're like, you know, this is what we'd be afraid of model.
That kind of like charming on the surface, very goal oriented, and, you know, very dark on the inside. You know, and, you know, on the surface, their behavior might look like the behavior of someone else, but their goals are very different.
A question somebody might have is, listen, you know, you mentioned earlier the importance of being empirical. And in this case, you're trying to estimate, you know, listen, are these activations sus? But is this something we can afford to be empirical about in on or do we need like a very good first principle theoretical reason to think, no, it's not just that these MRIs of the model correlate with, you know, being bad.
We need just like some just like deep root math proof that this is aligned. So it depends what you mean by empirical.
I mean, a better term would be phenomenological. Like, I don't think we should be purely phenomenological.
And like, you know, here are some brain scans of like, really dangerous models. And here are some brain scans.
I think the whole idea of mechanistic interpretability is to look at the underlying principles and circuits. But I guess the way I think about it is like, on one hand, I've actually always been a fan of studying these circuits at the lowest level of detail that we possibly can.
And the reason for that is kind of that's how you build up knowledge, even if you're ultimately aiming for there's too many of these features, it's too complicated. At the end of the day, we're trying to build something broad and we're trying to build some broad understanding.
I think the way you build that up is by trying to make a lot of these very specific discoveries. Like you have to, you have to understand the building blocks and then you have to figure out how to kind of use that to draw these broad conclusions, even if you're not going to figure out everything.
You know, I think you should probably talk to Chris Ola, who would have much more detail, right? This is my kind of high level thinking on it. Like Chris Ola controls the interpretability agenda.
Like, you know, he's the one who decides what to do on interpretability. This is my high level thinking about it, which is not going to be as good as his.
Does the bull case on anthropic rely on the fact that mechanistic interpretability is helpful for capabilities? I don't think so at all. Now, I do think, I think in principle, it's possible that mechanistic interpretability could be helpful with capabilities.
We might, for various reasons, not choose to talk about it if that were the case. That wasn't something that I thought of or that any of us thought of at the time of anthropics founding.
I mean, we thought of ourselves is like, you know, that, that wasn't something that I thought of, thought of, or that any of us thought of at the time of Anthropics founding. I mean, we, we thought of ourselves as like, you know, we're people who are like good at scaling models and good at doing safety on top of those models.
And like, you know, we think that we have a very high talent density of folks who are good at that. And, you know, my view has always been talent density beats talent mass.
Um, and so, you know, that's, that's know, that's more of our bull case. Talent density beats talent mass.
I don't think it depends on some particular thing. Like others are starting to do mechanistic interpretability now, and I'm very glad that they are.
You know, that was a part of our theory of change is paradoxically to make other organizations more like us.

Talent density, I'm sure, is important.

But another thing Anthropica has emphasized is that you need to have frontier models in order to do safety research. And of course, like actually be a company as well.
The current frontier models, something somebody might guess like GPT-4 o'clock to like $100 million or something like that. That general order of magnitude in very broad terms is not wrong.
But, you know, we're two to three years from now, the kinds of things you're talking about, we're talking more and more orders of magnitude to keep up with that. And to if it's the case that safety requires to be on the frontier.
I mean, what is the case in which Anthropic is like competing with these Leviathans to stay on that same scale? I mean, I think it's I think it's a very it's a situation with a lot of tradeoffs, right? I think's, I think it's not easy. I guess to go back, maybe I'll just like answer the questions one by one, right? So like to go back to like, you know, why, why is safety so tied to scale? Right.
Some people don't think it is, but like, if I, if I just look at like, you know, where, where, where have been, where have been the areas that, you know, you know, I don't know, like safety methods have like been put into practice or like worked for something, for anything, even if we don't think they'll work in general. You know, I go back to thinking of all the ideas, you know, something like, you know, debate and amplification, right? You know, back in 2018, when we wrote papers about those at OpenAI, it was like, well, human feedback isn't quite going to work, but, you know, debate and amplification will take us beyond that.

But then if you actually look at and we've, you know, done attempts to do debates, we're really limited by the quality of the model where it's like, you know, for two models to have a debate that is coherent enough that a human can judge it so that the training process can actually work.

You need models that are at or maybe even beyond on some topics the current frontier. Now, you can come up with a method.
You can come up with the idea without being on the frontier. But for me, that's a very small fraction of what needs to be done.
It's very easy to come up with these methods. It's very easy to come up with like, oh, the problem is X, maybe a solution is Y.
But I really want to know whether things work in practice, even for the systems we have today. And I want to know what kinds of things go wrong with them.
I just feel like you discover 10 new ideas and 10 new ways that things are going to go wrong by trying these in practice. And that empirical learning, I think it's just not as widely understood as it should be.
I would say the same thing about methods like constitutional AI. And some people say, oh, it doesn't matter.
We know this method doesn't work. It won't work for pure alignment.
I neither agree nor disagree with that. I think that's just kind of overconfident.
The way we discover new things and understand the structure of what's going to work and what's not is by playing around with things. Not that we should just kind of blindly say, oh, this worked here and so it'll work there.
But you really start to understand the patterns, like with the scaling laws. Even mechanistic interpretability, which might be the one area I see where a lot of progress has been made without the frontier models.
We're seeing in the work that, say, OpenAI put out a couple months ago that using very powerful models to help you auto-interpret the weak models, again, that's not everything you can do in interpretability, but that's a big component of it. And we found it useful too.
And so you see this phenomenon over and over again, where it's like, you know, the scaling and the safety are these two snakes that are like coiled with each other, always even more than you think, right? You know, with interpretability, like I think three years ago, I didn't think that this would be as true of interpretability, but somehow it manages to be true. Why? Because intelligence is useful.
It's useful for a number of tasks. One of the tasks it's useful for is like figuring out how to judge and evaluate other intelligence.
And maybe someday even for doing the alignment research itself. Given all that's true, what does that imply for Anthropik when in two to three years, these Leviathans are doing like $10 billion training runs.
Choice one is if we can't or if it costs too much to stay on the frontier, then we shouldn't do it. And we won't work with the most advanced models.
We'll see what we can get with models that are not quite as advanced. I think you can get some value there, like non-zero value, but I'm kind of skeptical that the value is all that high or the learning can be fast enough to really be in favor of the task.
The second option is you just find a way. You just accept the trade-offs.
And I think the trade-offs are more positive than they appear because of a phenomenon that I've called race to the top. I could go into that later, but I'll just let me put that aside for now.
And then I think the third phenomenon is, you know, as things get to that scale, I think this may coincide with, you know, starting to get into some non-trivial probability of very serious danger. Again, I think it's going to come first from misuse, the kind of bio stuff that I talked about, but I don't think we have the level of autonomy yet to worry about some of the alignment stuff happening in like two years, but it might not be very far behind that at all.
That may lead to unilateral or multilateral or government-enforced, which we support, decisions not to scale as fast as we could.

That may end up being the right thing to do.

So, you know, actually, that's kind of like I kind of hope things go in that in that direction.

And then we don't have this hard trade off between we're not in the frontier and we can't quite do the research as well as as well as we want or influence other orgs as well as we want or versus we're kind of on the frontier and like have to accept the trade-offs which are net positive but like have a lot in both directions. Okay, on the misuse versus misalignment, those are both problems as you mentioned but in the long scheme of things, what are you more concerned about?

Like 30 years down the line, which do you think will be considered a bigger problem?

I think it's much less than 30 years, but I'm worried about both. I don't know.
If you have a model that could, in theory, you know, like take over the world on its own, if you were able

to control that model, then, you know, it follows pretty simply that, you know, if a model was

following the wishes of some small subset of people and not others, then those people could

Thank you. If you were able to control that model, then it follows pretty simply that if a model was following the wishes of some small subset of people and not others, then those people could use it to take over the world on their behalf.
The very premise of misalignment means that we should be worried about misuse as well with similar levels of consequences. But some people who might be more doomer-y than you would say misuse is you're already working towards the optimistic scenario there because you've at least figured out how to align the model with the bad guys now you just need to make sure that it's aligned with the good guys instead why do you think that you could get to the point where it's aligned with the bad you know you haven't already solved it i guess if you had the view that like alignment is completely unsolvable then uh, you know, then you'd be like, well, I don't, you know, we're dead anyway.
So I don't want to worry about misuse. That's not my position at all.
But also like you should think in terms of like what's a plan that would actually succeed that would make things good. Any plan that actually succeeds, regardless of how hard misalignment is to solve, any problem, any plan that actually succeeds is going to to solve misuse as well as misalignment.
It's gonna need to solve the fact that like, as the AI models get better, you know, faster and faster, they're gonna create a big problem around the balance of power between countries. They're gonna create a big problem around, is it possible for a single individual to do something bad that it's hard for everyone else to stop? Any actual solution that leads to a good future needs to solve those problems as well.
If your perspective is we're screwed because we can't solve the first problem, so don't worry about problems two and three, that's not really a statement you shouldn't worry about problems two and three. They're in our path no matter what.
Yeah, in this scenario, we succeed. We have to solve all of this.
Yeah, we might as well operate. We should be planning for success, not for failure.
If we see this doesn't happen and the right people have the superhuman models, what does that look like? Like who are the right people? Who is actually controlling the model from five years from now? Yeah, I mean, my view is that these things are powerful enough that I think it you know, it's going to involve, you know, substantial role or at least involvement of, you know, some kind of government or assembly of government bodies. Again, like, you know, there are kind of very naive versions of this.
Like, you know, I don't think we should just, you know, I don't know, like hand the model over to the UN or whoever happens to be in office at a given time. Like I could see that go poorly, but there it's, it's too powerful.
There needs to be some kind of legitimate process for managing this technology, which, you know, includes the role of the people building it includes the role of like democratically elected authorities includes the role of, you know, all the, all the individuals who be affected by it. So that they're at the end of the day, there needs to be some politically legitimate process.
But what does that look like? If it's not the case that you just hand it to whoever the president is at the time? Yeah. Is what is the body look like? I mean, is it something these are things it's really hard to know ahead of time? Like, I think, you know, people love to kind of propose these broad plans plans and say like, oh, this is the way we should do it.
This is the way we should do it. I think the honest fact is that we're figuring this out as we go along and that, you know, anyone who says, you know, this is the body.
You know, we should create this kind of body modeled after this thing. Like I think we should try things and experiment with them with less powerful versions of the technology we need to figure this out in time but but also it's not really the kind of thing you can know in advance the the long-term benefit trust that you have how did how would that interface with this body is that the body itself if not is it like was it just for the context you might want to explain what it is for the audience but i don't know i think that the long-term benefit trust is like a much a much narrower thing like this is something that like makes decisions for anthropic so this is basically a body is described in a recent vox article we'll be saying more about it in you know later later this year uh but it's basically a body that over time uh gains the ability to appoint the majority of the board seats of anthrop.
And this is so, you know, it's a mixture of experts in, I'd say, like AI alignment, national security, and philanthropy in general. But if control is handed to them of Anthropic, that doesn't imply that control of, if Anthropic has AGI, the control of AGI itself is handed to them.
That doesn't imply that Anthropic or any other entity should be the entity that like makes decisions

about AGI on behalf of humanity.

I would think of those as different.

I mean, there's lots of, you know,

like if Anthropic does play a broad role,

then you'd want to like widen that body to be,

you know, like a whole bunch of different people

from around the world.

Or maybe you construe this as very narrow.

And then, you know, there's some like broad committee

somewhere that like manages all the AGI's of all the companies on behalf of anyone.

I don't know.

Like, I think my view is you shouldn't be sort of overly constructive and utopian.

Like, we're dealing with a new problem here.

We need to start thinking now about, you know, what are the governmental bodies and structures

that could deal with it? Okay. So let's forget about governance.
Let's just talk about what this going well looks like. Obviously, there's the things we can all agree on, you know, cure all the diseases, you know, solve all the problems.
Things all humans would say, I'm down for that. But now it's 2030.
You've solved all the real problems that everybody can agree on. What happens next? What are we doing with a superhuman God? I think I actually want to like, I don't know, like disagree with the framing or something like this.
I actually get nervous when someone says like, what are you going to do with the superhuman AI? Like we've learned a lot of things over the last 150 years about like markets and democracy and each person can kind of define for themselves, like what the best way for them to have the human experience is and that you know societies work out norms and what they value in this just in this very like complex and decentralized way now again if you have these safety problems that can be a reason why you know and especially from the government there needs to be maybe until we've solved these problems a certain amount of centralized control. But as a matter of like, we've solved all the problems, now how do we make things good? I think that most people, most groups, most ideologies that started with like, let's sit down and think over what the definition of a good life is.
I think most of those have led to disaster. But so this vision you have of a sort of tolerant, liberal democracy, market-oriented system with AGI, like what is, each person has their own AGI? Like, what does that mean? I don't know.
I don't know what it looks like, right? Like, I guess what I'm saying is like, we need to solve the kind of important safety problems and the important externalities. And then subject to that, you know, which again, those could be just narrowly about alignment.
There could be a bunch of economic issues that are super complicated and that we can't solve. Subject to that, we should think about what's worked in the past.
And I think in general, unitary visions for what it means to live a good life have not worked out well at all. On the opposite end of things going well or good actors having control of AI, we might want to touch on China as a potential actor in the space.
First of all, being at Baidu and seeing progress in AI happening generally, why do you think the Chinese have underperformed? Baidu had a scaling laws group many years back. Or is the premise wrong? And I'm just not aware of the progress that's happening there.
Well, for the scaling laws group, I mean, that was an offshoot of the stuff we did with speech. So, you know, there were still some people there, but that was a mostly Americanized lab.
I mean, I was there for a year. That was, you know, my first foray into deep learning.
It was led by Andrew Wing. I never went to China.
Most, you know, there's like a US lab. So I think that was somewhat disconnected, although it was an attempt by, you know, a Chinese entity to kind of get it, get into the game.
But I don't know. I think since then, you know, I couldn't speculate, but I think they've been maybe very commercially focused and not as focused on these kind of fundamental research side of things around scaling laws.
Now, I do think because of all the excitement with the release of ChatGPT in November or so, that's been a starting gun for them as well. And they're trying very aggressively to catch up now.
I think the US is quite substantially ahead, but I think they're trying very hard to catch up now. How do you think China thinks about AGI? Are they thinking about safety and misuse or not? I don't really have a sense.
You know, one concern I would have are people say things like, well, China isn't going to develop an AI because, you know, they like stability or, you know, they're going to have all these restrictions to make sure things are in line with what the CCP wants. You know, that might be true in the short term and for consumer products.
My worry is that if the basic incentives are about national security and power, that's going to become clear sooner or later. And so, you know, I think they're going to, if they see this as, you know, a source of national power, they're going to at least try to do what's most effective.

And that, you know, that could lead them in the direction of AGI.

At what point, like, is it possible for them, they just get your blueprints or your code

base or something that they can just spin up their own lab that is competitive at the

frontier with the leading American companies?

Well, I don't know about FAST, but I'm like, I'm concerned about this.

So this is one reason why we're focusing so hard on cybersecurity. You know, we've worked with our cloud providers.
We really, you know, like, you know, we had this blog post out about security where we said, you know, we have a two key system for access to the model weights. We have other measures that we put in place or thinking of putting in place that, you know, we haven't announced.
We don't want an adversary to know about them, but we're happy to talk about them broadly. All this stuff we're doing is, by the way, not sufficient yet for a super determined state-level actor at all.
I think it will defend against most attacks and against a state-level actor who's not, you know, who's less determined. But there's a lot more we need to do, and some of it may require new research on how to do security.
Okay, so let's talk about what it would take at that point. You know, we're at anthropic offices, and, you know, it's like, God good is security.
We had to get badges and everything to come in here. But the eventual version of this building or bunker or whatever, where the AGI is built, I mean, what does that look like? Are we, is it a building in the middle of San Francisco or is it you're out in the middle of Nevada or Arizona? Like, what is the point in which you're like Los Alamos-ing it? At one point, there was a running joke somewhere that, you know, the way building AGI would look like is, you know, there would be a data center next to a nuclear power plant next to a bunker.
Yeah. And, you know, that we'd all kind of live in the bunker and everything would be local so it wouldn't get on the internet.
You know, again, if we, you know, if we take seriously the rate at which the, you know, rate at which all this is going to happen, which I don't know, I can't be sure of it. But if we take that seriously, then, you know, it does make me think that maybe not something quite as cartoonish as that, but that something like that might happen.
What is the timescale on which you think alignment is solvable? If like these models are getting to human level in some things in two to three years, what is the point at which they're aligned? I think this is a really difficult question, because I actually think often people are thinking about kind of alignment in the wrong way. I think there's a general feeling that it's like models are misaligned or like there's like an alignment problem to solve, kind of like the Riemann hypothesis or something.
Like someday we'll crack the Riemann hypothesis. I don't quite think it's like that.
Not in a way that's worse or better. It might be just as bad or just as unpredictable.
When I think of like, you know, why am I scared? A few things I think of. One is, look, like, I think the thing that's really hard to argue with is like, there will be powerful models.
They will be agentic. We're getting towards them.
If such a model wanted to wreak havoc and destroy humanity or whatever, I think we have basically no ability to stop it. Like that's, I think just, if that's not true at some point, it'll continue to be true as we, you know, it will reach the point where it's true as we scale the models.
So that definitely seems the case. And I think a second thing that seems the case is that we seem to be bad at controlling the models, not in any particular way, but just their statistical systems.
And you can ask them a million things and they can say a million things in reply. And, you know, you might not have thought of a millionth of one thing that does something crazy.
Or when you train them, you train them in this very abstract way and you might not understand all the consequences of what they do in response to that. I mean, I think the best example we've seen of that is like being in Sydney, right? Where it's like, I don't know how they trained that model.
I don't know what they did to make it do all this weird stuff, like, you know, threaten people and, you know, have this kind of weird obsessive personality. But what it shows is that we can get something very different from and maybe opposite to what we intended.
And so I actually think facts number one and fact number two are like enough to be really worried. Like you don't need all this detailed stuff about, you know, converging instrumental goals or, you know, analogies to evolution.
Like actually one and two for me are pretty motivated. I'm like, okay, this thing this thing's gonna be powerful it could destroy us and like all the ones built so far like you know are at pretty decent risk of doing some random shit we don't understand yeah if i agree with that and i'm like okay i'm concerned about this the research agenda you have of a mechanistic interoperability plus you know constitution ai and the other rlhf stuff if you say that we're going to get something with like bio weapons or something that could be dangerous in two to three years.
Yes. Do these things culminate within two to three years of actually meaningfully contributing to preventing? Yes.
So I think, I think where I was going to go with this is like, you know, people talk about like doom by default or alignment by default. Like I think it might be kind of statistical.
Like, you know, like you might get, you know, with the current models, you might get Bing or Sydney or you might get Claude. And it doesn't really matter because Bing or Sydney, like if we take our current understanding and, you know, move that to very powerful models, you might just be in this world where it's like, okay, you make something and depending on the details, maybe it's totally fine.
You know, not really alignment by default, but just kind of like it depends on a lot of the details. And like, if you're very careful about all those details, you know what you're doing, you're getting it right.
But we have a high susceptibility to you mess something up in a way that you didn't really understand was connected to, actually, instead of making all the humans happy, it wants to, you know, turn them into pumpkins. Yeah, I just some weird shit, right? Because the models are so powerful, you know, they're like these kind of giants that are, you know, they're, they're like, you know, they're standing in a landscape.
And if they start to move their arms around randomly, they could just break everything. I guess I'm starting it with that with that kind of framing, because it's not like, I don't think we're aligned by default.
I don't think we're doomed by default and have some problem we need to solve. It has some kind of different character.
Now, what I do think is that hopefully within a timescale of two to three years, we get better at diagnosing when the models are good and when they're bad. We get better at training, increasing our repertoire of methods to train the model that they're less likely to do bad things and more likely to do good things in a way that isn't just relevant to the current models, but scales.
And we can help develop that with interpretability as the test set. I don't think of it as, oh man, we tried our LHF.
It didn't work. We tried constitutional.
It didn't work. Like we tried this other thing.
It didn't work. We tried mechanistic interpretability.
Now we're going to try mechanistic. I think this frame of like, man, we haven't cracked the problem yet.
We haven't solved the Riemann hypothesis isn't quite right. I think of it more as already with today's systems, we are not very good at controlling them.
And the consequences of that could be very bad. We just need to get more ways of increasing the likelihood that we can control our models and understand what's going on in them.
And we have some of them so far. They aren't that good yet.
But I don't think of this as a binary of works and not works. We're going to develop more and I do think that over over the next two to three years we're going to start eating that probability mass of ways things can go wrong you know it's kind of like in the core safety views paper there's probability mass of how hard the problem is I feel like that way of saying it isn't really even quite right right because I don't feel like it's the Riemann hypothesis to solve.
I, you know, I just feel like, you know, it's, it's almost like right now, if I try and,

you know, juggle five balls or something, I can juggle three balls, right? I actually can, but I can't juggle five balls at all, right? You have to practice a lot to do that. If I were to do that, I would almost certainly drop them.
And then just over time, you just get better at the task of controlling the balls. On that post in particular, what is your personal probability distribution over...
So for the audience, the three possibilities are, it is trivial to align these models with RLHF++ to it is a difficult problem, but one that a big company could solve to something that is basically impossible for human civilization currently to solve. If I'm capturing capturing those three what is your probability distribution over those three personally yeah I mean I'm not super into like what's your probability distribution of x I think all of those have enough likelihood that you know they should be considered seriously I'm more interested question I'm much more interested in is what could we learn that shifts probability mass between them what is the the answer to that? I think that one of the things mechanistic interpretability is going to do more than necessarily solve problems is it's going to tell us what's going on when we try to align models.
I think it's basically going to teach us about this. One way I could imagine concluding that things are very difficult is if mechanistic interpretability sort of shows us that, I don't know, problems tend to get moved around instead of being stamped out.
Or that you get rid of one problem, you create another one. Or it might inspire us or give us insight into why problems are kind of persistent or hard to eradicate or crop up.
Like for me to really believe some of these stories about like, you know, oh, something will always, you know, there's always this convergent goal in this particular direction. I think the abstract story is, it's not uncompelling, but I don't find it really compelling either, nor do I find it necessary to motivate all the safety work.
But like the kind of thing that would really be like, oh man, we can't solve this is like

we see it happening inside the x-ray. Yeah, because I think right now there's just,

there's way too many assumptions. There's way too much overconfidence about how all this is

going to go. I have a substantial probability mass on this all goes wrong.
It's a complete disaster,

but in a completely different way than anyone goes wrong. It's a complete disaster,

but in a completely different way than anyone had anticipated.

It would be beside the point to ask, like, how could it go different than anyone anticipated?

So on this in particular, what information would be relevant? How much would the difficulty of

aligning cloud three and the next generation of models basically be like, is that a big piece

of information? Is that not? So I think the people who are most worried are predicting that

All right. and the next generation of models basically be like, is that a big piece of information? Is that not going to be big? So I think the people who are most worried are predicting that all the subhuman like AI models are going to be alignable, right? They're going to seem aligned.
They're going to deceive us in some way. I think it certainly gives us some information, but I am more interested in what mechanistic interpretability can tell us.
Because, again, like you see this x-ray, it would be too strong to say it doesn't lie. But at least in the current systems, it doesn't feel like it's optimizing against us.
There are exotic ways that it could. You know, I don't think anything is a safe bet here.
But I think it's the closest we're going to get to something that isn't actively optimizing against us. Let's talk about the specific methods other than mechanistic interoperability that you guys are researching.
When we talk about RLHF or Constitution AI, whatever, RLHF++, if you had to put it in terms of human psychology, what is the change that is happening? Are we creating new drives, new goals, new thoughts? How is the model changing in terms of psychology? I think all those terms are kind of like inadequate for, you know, describing what's, it's not clear how useful they are as abstractions for humans either. I think we don't have the language to describe what's going on.
And again, I'd love to have the x-ray. I'd love to look inside and say, and kind of actually know what we're talking about instead of, you know, basically making up words, which is what I do, what you're doing and asking this question, where, you know, we should just be honest.
We really have very little idea what we're talking about. So, you know, it would be great to say, well, what we actually mean by that is, you know, this circuit within here turns, you know, turns on and, you know, and, you know, after we've trained the model, then, you know, this circuit is no longer operative or weaker.
And now, you know, we'd love to be able to say, again, we're going to take a lot of work to be able to do that. Model organisms, which you hinted at before when you said we're doing these evaluations to see if they're capable of, you know, doing dangerous things now and currently not.
How worried are you about a lab leak scenario where in fine tuning it or in trying to get these models to elicit dangerous behaviors, you know, make bioweapons or something. You like leak somehow and actually makes the bioweapon instead of telling you it can make the bioweapon.
With today's passive models, I think it's not that much, you know, chatbots. It's not so much of a concern, right? Because it's like, you know, if we were to fine tune a model, do that, we do it privately and work with the experts.
And so, you know, the leak would be like, you know, suppose the model got open sourced or something and, you know, and then someone. So I think for now, it's mostly a security issue.
In terms of models truly being dangerous, I mean, you know, I think we do have to worry that it's like, you know, if we make a truly powerful model and we're trying to like see what makes it dangerous or safe, then there could be more of a one-shot thing where it's like, you know, some risk that the model takes over. I think the main way to control that is to make sure that the capabilities of the model that we test are not such that they're capable of doing this.
At what point would the capabilities be so high where you say, I don't even want to test this? Oh, well, there's different things. I mean, there's capability testing.
But that itself could lead to, if you're testing it and replicate that, what if it actually does? I think what you want to do is you want to extrapolate. So we've talked with ARK about this, right? You know, you have like factors of two of compute or something where, you know, you're like, okay, you know, you know, can the model do something like, you know, open up an account on AWS and like make some money for itself.
Like some of the things that are like obvious prerequisites to like complete survival in the wild.

And so just set those thresholds very well, you know, kind of very well below.

And then as you proceed upward from there, do kind of more and more rigorous tests and be more and more careful about what it is you're doing. On Constitution AI, and feel free to explain what this is for the audience, but who decides what the constitution for the next generation of models or potentially superhuman model is? How is that actually written? I think initially, you know, to make the constitution, we just took some stuff that was like broadly agreed on, like the UN charter of, you know, UN declaration on human rights and, you know, some of the stuff from Apple's terms of service, right? Stuff that's like, you know, consensus on like what's acceptable to say or like, you know, what what basic things are able to be included.
So one, I think for future constitutions, we're looking into like more participatory processes for making these. But I think beyond that, I don't think there should be like one constitution for like a model that everyone uses.
Like probably models constitution should be very, very simple, right? It should only have very basic facts that everyone would agree on. And then there should be a lot of ways that you can customize, including appending, you know, constitutions.
And, you know, I think beyond that, we're developing new methods, right? This is, you know, I'm not imagining that this or this alone is the method that we'll use to train superhuman AI, right? Many of the parts of capability training may be different. And so it could look very different.
And again, I'd go there like there are levels above this. Like I'm pretty uncomfortable with like, here's the AI's constitution.
It's going to run the world. Like that, you know, again, like just normal lessons from like how societies work and how politics works like that.

That just kind of, yeah, that strikes me as fanciful.

Like, you know, I think we should try to hook these things into, you know, even when they're very powerful. again, after we've mitigated the safety issues, like any good future, even if it has all these security issues that we need to solve, it somehow needs to end with something that's more decentralized and less like a godlike super.
I just don't think that ends well. What scientists from the Manhattan Project do you respect most in terms of they acted most ethically under the constraints they were given? Is there one that comes to mind? I don't know.
I mean, you know, I think there's a lot of answers you could give. I mean, I'm definitely a fan of Zillard for having kind of figured it out.
He was then, you know, against the actual dropping of the bomb. I don't actually know the history well enough to have an opinion on whether, you know, demonstration of the bomb could have ended the war.
I mean, that involves a bunch of facts about Imperial Japan that are, you know, that are complicated and that I'm not an expert on. But, you know, Zillard seemed to, you know, he discovered this stuff early.
He kept it secret, you know, you know, patented some of it and put it in the hands of the British Admiralty. So, you know, he seemed to display the right kind of awareness as well as as well as as well as discovering stuff.
I mean, it was when I read that book that I kind of, you know, when I wrote this big blob of compute doc and many, you know, I only showed it to a few people and there were other docs that I showed to almost no one. So, you know, I yeah, I was a bit a bit inspired by this again i mean i you know we can all get self-aggrandizing here like we don't know how it's going to turn out or if it's actually going to be actually going to be something on par with the manhattan project i mean you know this this could all be just silicon valley people building technology and you know just kind of like having delusions of grandeur so i don't know how it's's going to turn out.
I mean, if the scaling stuff is true, then it's more bigger than the Manhattan Project. Yeah, it certainly could be bigger.
I just, you know, we should always kind of, I don't know, maintain this attitude that it's really easy to fool yourself. If you were asked by the government, if you're a physicist during World War II and you were asked by the government to contribute non-replaceable research to the Manhattan Project, well, what do you think you would have said? Yeah, I mean, I think given you're in a war with the Nazis, at least during the period when you thought that the Nazis were, I don't, yeah, I don't really see much choice, but to do it, if it's possible, you know, you have to figure it's going to be done within 10 years or so by someone.
Regarding cybersecurity, what should we make of the fact that there's a whole bunch of tech companies which have ordinary tech company security policy that publicly seeming facing, it's not obvious that they've been hacked. Like Coinbase still has its Bitcoin.
You know, Google, as far as I know, my Gmail hasn't been leaked. Should we take from that, that current status quo tech company security practices are good enough for AGI or just simply that nobody has tried hard enough? It would be hard for me to speak to, you know, current tech company practices.
And of course, there may be many attacks that we don't know about where things are stolen and then silently used. You know, I mean, I think an indication of it is when someone really cares, basically cares about attacking someone, then often the attacks happen.
So, you know, recently we saw that some fairly high officials of the U.S. government had their email accounts hacked via Microsoft.
Microsoft was providing the email accounts. So, you know, presumably that related to information that was, you know, of great interest to, you know, to foreign adversaries.
And so it sounds, it seems to me at least, you know, that the evidence is more consistent with, you know, when something is really high enough value, then, you know, then, you know, someone acts and it's stolen. And my worry is that, of course, with AGI, we'll get to a world where, you know, the value is seen as incredibly high, right? That, you know, it'll be like stealing nuclear missiles or something.
You can't be too careful on this stuff. And, you know, at every place that I've worked, I push for the cybersecurity to be better.
One of my concerns about cybersecurity is, you know, it's not kind of something you can trumpet. I think a good dynamic with safety research is like, you know, you can get companies into a dynamic, and I think we have, where, you know, you can get them to compete to do the best safety research and, you know, kind of use it as a, I don't know, like a recruiting point of competition or something.
We used to do this all the time with interpretability, you know, and then sooner or later, other orgs started recognizing the defect and started working on interpretability, whether or not that was a priority to them before. But I think it's harder to do that with cybersecurity because a bunch of this stuff you have to do in quiet.
And so we did try to put out one post about it, but I think mostly you just see the results. You know, I think people should, you know, a good norm would be, you know be people see the cybersecurity leaks from companies or leaks the model parameters or something and say, they screwed up.
That's bad. If I'm a safety person, I might not want to work there.
Of course, as soon as I say that, we'll probably have a security breach tomorrow. But that's part of the game here, right? i think that's part of um you know trying trying to make things safe i want to go back to the thing we're talking about earlier where the ultimate level of cyber security required for two to three years from now and whether it requires a bunk like are you actually expecting to be in a physical bunker in two to three years or is that just a metaphor yeah i mean i think i think that's a metaphor um you we're still figuring it out.
Like something I would think about is like, I think security of the data center, which may not be in the same physical location as us, but, you know, we've worked very hard to make sure it's in the United States. But securing the physical data centers and the GPUs, I think some of the really expensive attacks, if someone was really determined, just involve going into the data center and just, you know, trying to steal the data directly or as it's flowing from the data center to, you know, to us.
I think these data centers are going to have to be built in a very special way. I mean, given the way things are scaling up, you know, we're probably anyway heading to a world where, you know, the, you know, networks of data centers, you know, cost as much as aircraft carriers or something.
And so, you know, they're already going to be pretty unusual objects. But I think in addition to being unusual in terms of their ability, you know, to link together and train gigantic, gigantic models, they're also going to have to be very secure.
Speaking of which, how, you know, there's been sorts of rumors on the difficulty of procuring the power and the GPUs for the next generation of models. What has the process been like to secure the necessary components to do the next generation? That's something I can't go into great detail about.
You know, I will say, look, like, you know, people think of even industrial scale data centers, right? People are not thinking at the scale that I think these models are going to go to very soon. And so whenever you do something at a scale where it's never been done before, you know, every single component, every single thing has to be done in a new way than it was before.
And so, you know, you may run into problems with, you know, surprisingly simple components. Power is one that you mentioned.
And is this something that Anthropoc has to handle, or can you just outsource it? You know, I mean, for data centers, we work with cloud providers, for instance. What should we make about the fact that these models require so much training and the entire corpus of internet data in order to be subhuman? Whereas, you know, if GPT-4, there's been estimates that, you know, it was like 10 to the 25 flops or something where, you know, whereas you, I mean, you can take these numbers to a grain of salt, but there's reports that, you know, human brain from the time it is born to the time a human being is 20 years old, that's like on the order of 10 to the 20 flops to simulate all those interactions.
We don't have to go into the particular zone of those numbers, but should we be worried about how sample inefficient these models seem to be? Yeah. So I think that's one of the remaining mysteries.
One way you could phrase it is that the models are maybe two to three orders of magnitude smaller than the human brain, if you compare it to the number of synapses, while at the same time being trained on, you know, three to four more orders of magnitude of data. If you compare to, you know, number of words a human sees as they're developing to age 18, it's, I don't remember exactly, but I think it's in the hundreds of millions.
Whereas for the models, we're talking about the hundreds of billions to the trillions. So what explains this? There are these offsetting things where the models are smaller, they need a lot more data, and they're still below human level.
But so, you know, there's some way in which, you know, the analogy to the brain is not quite right, or is breaking down, or there's some, there's some missing factor, you know, this is just kind of like in physics, where it's like, you know, we can't explain the Mickelson Morley experiment or like I'm forgetting one of the other 19th century physics paradoxes.

But like, I think it's one thing we don't quite understand. Right.
Humans see so little data and they still do fine. One theory on it, it could be that, you know, it's like our other modalities.
You know, how do we get, you know, 10 to the 14th bits into the human brain? well, most of it is kind of these images. And maybe a lot of what's going on inside the human brain is like, you know, our mental workspace involves all these, you know, these simulated images or something like that.
But honestly, I think intellectually, we have to admit that that's a weird thing that doesn't match up. And, you know, it's one reason I'm a bit, you know, skeptical of kind of biological analogies.
I thought in terms of them like five or six years ago, but now that we actually have these models in front of us as artifacts, it feels like almost all the evidence from that has been screened off by what we've seen. And what we've seen are models that are much smaller than the human brain and yet can do a lot of the things that humans can do and yet paradoxically require a lot more data.
So maybe we'll discover something that makes it all efficient or maybe we'll understand why the discrepancy is present. But at the end of the day, I don't think it matters, right? If we keep scaling the way we are, I think what's more relevant at this point is just measuring the abilities of the model and seeing how far they are from humans.
And they don't seem terribly far to me. Does this scaling picture and the big blob of compute more generally, does that underemphasize the role that algorithmic progress has played when you compose the big blob of compute? So, you know, you're talking about LSTMs presumably at that point.
Presumably the scaling on that would not have you at cloud two at this point. So are you underemphasizing the role that an improvement of the scale of transformer could be having here when you put it up behind the label of scaling? This big blob of compute document, which I still have not made public, I probably should for like historical reasons.
I don't think it would tell anyone anything they don't know now. But when I wrote it, I actually said, look, there are seven factors that and you know, I wasn't, I wasn't like, these are the factors, but I was just like, let me give some sense of the kinds of things that matter and what don't.
And so I wasn't thinking like, these are the, you know, there could be nine, there could be five, but like the things I said were, I said, number of parameters, scale of the model, like, you know, the compute and compute matters. Quantity of data matters.
Quality of data matters. Loss function matters.
So like, you know, are you doing RL or are you doing next word prediction? If your loss function isn't rich or doesn't incentivize the right thing, you won't get anything. So those were the key four ones, which I think are the core of the hypothesis.
But then I said three more things. One was symmetries, which is basically like if your architecture doesn't take into account the right kinds of symmetries, it doesn't work or it's very inefficient.
So, for example, convolutional neural networks take into account translational symmetry. LSTMs take into account time symmetry.
But a weakness of LSTMs is that they can't attend over the whole context. there's kind of this structural weakness like if a model isn't structurally capable of like absorbing and managing things that happened in a far enough distant past then it's just like it's kind of like you know like the compute doesn't flow like the spice doesn't flow it's like you can't like like the the blob has to be unencumbered, right? It kind of, it's not, it's not going to work if, if you artificially close things off.
And I think RNNs and LSTMs artificially close things off because they, they close you off to the distant past. And so again, things need to flow freely.
If they don't, it doesn't work. And then, you know, I added a couple of things.
One of them was like conditioning, which is like, you know, if you're, if the thing you're optimizing with is just really numerically bad, like you're going to have trouble. And so this is why like Adam works better than, you know, than normal SGD.
And I think I'm forgetting what the seventh condition was, but it was, it was similar to things like this, where it's like, you know, if you, if you, if you set things up in kind of a way that's set up to fail or that doesn't allow the compute to work in an uninhibited way, then it won't work. And so Transformers were kind of within that even though I can't remember if the Transformer paper had been published.
It was around the same time as I wrote that document. It might have been just before.
It might have been just after. It from that view that the the the way to think about these algorithmic progresses is not as increasing the power of the blob of compute but simply getting rid of the artificial hindrances that older architectures have is that is that a fair that's that's a little that yeah that's that's a little how i think about it you know again if you go back to like ilia's like the models want to learn yeah yeah like like the compute wants to be free yeah and like you know it's being blocked in various ways where you like don't understand that it's being blocked and so you need to like free it up right right i love the the radians changing that to spice okay um on that point though so do you think that another thing on the scale of a transformer is coming down the pike to enable the next great iteration? I think it's possible.
I mean, people have worked on things like trying to model very long time dependencies. There's various different ideas where I could see that we're kind of missing an efficient way of representing or dealing with something.
So I think those inventions are possible. I guess my perspective would be, even if they don't happen, we're already on this very, very steep trajectory.
And so I'm less, I mean, we're constantly trying to discover them as are others. But things are already on such a fast trajectory.
All that would do is speed up the trajectory even more. And probably not by that much because it's already going so fast.
Is something embodied or having an embodied version of a model, is that at all important in terms of getting either data or progress? I think of that less in terms of the, you know, like a new architecture and more in terms of like a loss function. Like the data, the environments you're exposing yourself to end up being very different.
And so I think that could be important for learning some skills, although data acquisition is hard. And so things have gone through the language route.
And I would guess we'll continue to go through the language route, even as, you know, even as more as possible in terms of embodiment. And then the other possibilities I mentioned, RL, you can see it as, yeah, I mean, we kind of already do RL with RLHF, right? People are like, is this in alignment? Is this capabilities? I always think in terms of the two snakes, right? They're kind of often hard to distinguish.
So we already kind of use RL in these language models, but I think we've used RL less in terms of getting them to take actions and, you know, do things in the world. But, you

know, when you take actions over a long period of time and understand the consequences of those actions only later, then, you know, RL is a typical tool we have for that. So I would guess that in terms of models taking action in the world, that RL will, you know, will become a thing with all the power and all the safety issues that come with it.
When you project out in the future, do you see the way in which these things will be integrated into productive supply chains? Do you see them talking with each other and criticizing each other and contributing to each other's output? Or is it just the model one shots, one model one shots the answer or the work? Models will undertake extended tasks. That will have to be the case.
I mean, we may want to limit that to some extent because it may make some of the safety problems easier. But some of that I think will be required.
In terms of our models talking to models, are they talking to humans? Again, this goes kind of out of the technical realm and into the socio-cultural economic realm where my heuristic is always that it's very, very difficult to predict things. And so I feel like these scaling laws have been very predictable.
But then when you say like, well, you know, when is there going to be a commercial explosion in these models? Or what's the form it's going to be? Or are the models going to do things instead of humans or pairing with humans?

I feel like certainly my track record on predicting these things is terrible. But I also looking around, I don't really see anyone who's track record is great.
You mentioned how fast progress is happening, but also the difficulties of integrating within the existing economy into the way things work. do you think there will be enough time to actually have large revenues from AI products before the next model is just so much better or we're in like a different landscape entirely? It depends what you mean by large, right? You know, I think multiple companies are already in the, you know, a hundred million to billion per year range.
What will it get to the hundred billion or trillion range, you know, before I, that stuff is just so hard to predict, right? It's, and it's, it's, it's not even super well defined. Like, you know, I think right now there are companies that are throwing a lot of money at, at generative AI, you know, as, as, as customers.
But, and, and, you know, I think, I think that's the right thing for them to do. And they'll, you know, they'll find uses for it, but it doesn't mean they're, doesn't mean it But it doesn't mean they're finding uses or the best uses from day one.
So even money changing hands is not quite the same thing as economic value being created. But surely you've thought about this from the perspective of Anthropik, where these things are happening so fast, then it should be an insane valuation, right? Even us who have not been super focused on commercialization and more on safety.
I mean, you know, the graph goes up and it goes up, it goes up relatively quickly. Yeah.
So, you know, I can, I can only imagine what's happening that, you know, the, the orgs or, you know, they're, this is, this is, this is their singular focus. So it's certainly happening fast, but, you know, again, it's, it's like, it's the exponential from the small base while the technology itself is moving fast.
So it's, it's kind of a race between how fast the technology is getting better and how fast it's integrated into the economy. And that, I think that's just a very unstable and turbulent process.
Both things are going to happen fast. But if you ask me exactly how it's going to play out, exactly what order things are going to happen, I don't know.
And I'm kind of skeptical of the ability to predict. I'm kind of curious with regards to Anthropics specifically.
Yes. You're a public benefit corporation.
Yes. And rightfully so, you want to make sure that this is an important technology.
Obviously, the only thing you want to care about is not sure about the value. But how do you talk to investors who are putting in like hundreds of millions, billions of dollars of money? Like, how do you talk to them about the fact that how do you get them to put in this amount of money without the shareholder value being the main concern? So I think the LTBT is, you know, the right thing on this, right? You know, I mean, we're going to talk more about the LTBT, but like some version of that has been in development since the beginning of Anthropic, even formally, right? And so, you know, from the beginning, you know, even as the body has changed in some ways, it's like from the beginning, it was like this body is going to exist.
And it's, you know, it's unusual. Like every traditional investor who invests in Anthropic, you know, has to, you know, looks at this.
Some of them are just like, whatever, you run your company how you want. Some of them are like, you know, oh my God, like this, this, you know, this body of random people or to them, random people could like, you know, could, could move Anthropic in a direction that's, you know, that's totally contrary to our, and now there are, there are legal limits on that, of course, but, you know, we have to have this conversation with every investor.
And then it gets into a conversation of, well, what are the kinds of things that, you know, that we would we we might do that would be contrary to the to the, you know, to the interests of traditional investors. And just having those conversations has helped get everyone on the same page.
I want to talk about the physics and the fact that so many of the founders and the employees at Anthropic are physicists. What is the, I mean, we talked in the beginning about the scaling laws and how the power laws from physics are something you see here.
But, you know, what are the actual like approaches and ways of thinking from physics that seem to have carried over so well? Is that notion of effective theory is super useful? You know, what is going on here? I mean, I think part of it is just physicists learn things really fast. We have generally found that, you know, if we hire, you know, someone who is a, you know, physics PhD or something that they can learn ML and contribute just very, very quickly in most cases.
And, you know, because several of our founders, myself, Jared Kaplan, Sam McCandlish were physicists, we knew a lot of other physicists. And so we were able to hire them.
And now there's, I don't know how many, it might be 30 or 40 of them here. ML is still not yet a field that has an enormous amount of depth.
And so they've been able to get up to speed very quickly. Are you concerned that there's like a lot of people who would have been doing physics or something, whatever, into finance instead and since anthropic exist they have now been recruited to go into ai and you know they're you obviously care about ai safety but you know maybe in the future they leave and they get funded to do their own thing is that a concern that you're bringing more people into the ecosystem here yeah i mean you know i think there's there's like set of action, you know, like we're causing GPUs to exist.
You know, there's a lot of kind of side effects that you can't currently control or that you just incur if you buy into the idea that you need to build frontier models. And that's one of them.
A lot of them would have happened anyway. I mean, finance was a hot thing 20 years ago.
So physicists were doing it. Now ML is a hot thing.
And, you know, it's not like we've caused them to do it when they had no interest previously. But, you know, again, you know, at the margin, you're kind of you're kind of bidding things up.
And, you know, a lot of that would have happened anyway. Some of it some of it wouldn't, but it's all part of the calculus.
Do you think that cloud has conscious experience? How likely that is? This is another of these questions that just seems very unsettled and uncertain. One thing I'll tell you is I used to think that we didn't have to worry about this at all until models were kind of like operating in rich environments, like not necessarily embodied, but like that, you know, they, you know, they needed to like have a reward function and like have kind of long lived experience.
So I still think that might be the case, but the more we've looked at kind of these language models and particularly looked inside them to see things like induction heads, a lot of the cognitive machinery that you would need for active agents seems kind of already present in the base language models. So I'm not quite as sure as I was before that we're missing enough of the things that that you would need i think today's models just probably aren't smart enough that we should worry about this too much but i'm not 100 sure about this and i do think the models will get in a year or two like this might be a very real concern what would change if you found out that they are conscious are you worried that you're pushing the negative gradient to suffering like what is conscious is again one of these words that i i suspect it will like not end up having a a well-defined but it's like something to be but that yeah but but that yeah well i i i suspect that's that's that's a spectrum right uh so i don't know if we if we if we discover like that you know that i should care about claude let's say we discover that i should care about claude's experience as much as I should care about like a dog or a monkey or something.
Yeah, I would be, I would be kind of, kind of worried. I don't know if their experience is positive or negative.
Unsettlingly, I also don't know, like, I wouldn't know if any intervention that we made was more likely to make Claude, you know, have a positive versus negative experience versus not having one. If there's an area that is helpful with this, it's maybe mechanistic interpretability because I think of it as neuroscience for models.
And so it's possible that we could shed some light on this. Although it's not a straightforward factual question, right? It kind of depends what we mean and what we value.
We talked about this initially, but I want to get more specific. We talked initially about, you know, now that you're seeing these capabilities ramp up within the human spectrum, you think that the human spectrum is wider than we thought.
But yeah, more specifically, what have you, how is the way you think about human intelligence different now that the way you're seeing these, these marginal useful abilities emerge? How does that change your picture of what intelligence is? I think for me, the big realization on what intelligence is came with the like blob of compute thing, right? Like it's not, you know, there might be all these separate modules. There might be all this complexity.
You know, it's, you know, Rich Sutton called it the bitter lesson, right? It's almost called, has many names. It's been called the scaling hypothesis.
Like the first few people who figured it out was around 2017. I mean, you could go further back to, I think, I think Shane Legge was maybe the first person who really knew it.
Maybe Ray Kurzweil, although in a very vague way. But, you know, I think the number of people who understood it went up a lot around 2014 to 2017.
But I think, I think that was. It's like, well, how did intelligence evolve? Well, if you don't need very specific conditions to create it, if you can create it just from the right kind of gradient loss signal, then of course it's not so mysterious how it all happened.
It had this click of scientific understanding. In terms of like watching what the models can do, how has it changed my view of human intelligence? I wish I had something more intelligent to say on that.
I feel like, I don't know, one thing that's been surprising is like, I thought things might click into place a little more than they do. Like, you know, I thought like different cognitive abilities might all be connected and there was more of one secret behind them.
But it's like, the model just learns various things at different times, you know, and it can be like very good at coding, but like, you know, it can't, it can't quite, you know, prove the prime number theorem yet. And I don't, I mean, I guess it's a little bit the same for, for humans, although it's although it's weird, the juxtaposition of things it can do and not.
I guess the main lesson is like having theories of intelligence or how intelligence works. Like again, a lot of these words just kind of like dissolve into a continuum, right? They just kind of like dematerialize.
I think less in terms of intelligence and more in terms of what we see in front of us. Yeah, no, it's really surprising to me.
Two things. One is how discreet these like different parts of intelligent things that contribute to loss are rather than just being like one reasoning circuit or one general intelligence.
And the other thing talking with you that is surprising or interesting is many years from now, it'll be one of those things that looking back, it'll be, why wasn't this obvious to you? If you're seeing these smooth scaling curves, why the time where you're not completely convinced? So you've been less public than the CEOs of other AI companies. You know, you're not posting on Twitter.
You're not doing a lot of podcasts except for this one. What gives? Like, Why are you off the radar? Yeah, I aspire to this and I'm proud of this.
If people think of me as kind of boring and low profile, this is actually kind of what I want. So I don't know.
I've just seen a number of cases, a number of people I've worked with that I think you could say Twitter, although I think I mean a broader thing, like just kind of like attaching your incentives very strongly to like the approval or cheering of a crowd. I think that can destroy your mind.
And in some cases it can destroy your soul. And so I think I kind of deliberately tried to be a little bit low profile because I want to, I don't know, kind of like defend my ability to think about things intellectually in a way that's different from other people and isn't kind of tinged by the approval of other people.
So, you know, I've seen cases of folks who are deep learning skeptics and they become known as deep learning skeptics on Twitter. And then even as it starts to become clear to me, they kind of sort of change their mind.
They like, this is their thing on Twitter and they can't change their Twitter persona and so forth and so on. I don't really like the trend of kind of like personalizing companies, like the whole, you know, like cage match between CEOs approach.
Like I think it, it distracts people from the actual merits and concerns of like the, the, you know, the, the company in question. Like I kind of want people to like judge the like nameless bureaucratic institution.
You know, I want people to think in terms of the nameless bureaucratic institution and its incentives more than they think in terms of me. Everyone wants a friendly face, but actually I think friendly faces can be misleading.
Okay. Well, in this case, this will be a misleading interview because this has been a lot of fun.