Dario Amodei (Anthropic CEO) - Scaling, Alignment, & AI Progress

1h 58m

Here is my conversation with Dario Amodei, CEO of Anthropic.

Dario is hilarious and has fascinating takes on what these models are doing, why they scale so well, and what it will take to align them.

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.

Timestamps

(00:00:00) - Introduction

(00:01:00) - Scaling

(00:15:46) - Language

(00:22:58) - Economic Usefulness

(00:38:05) - Bioterrorism

(00:43:35) - Cybersecurity

(00:47:19) - Alignment & mechanistic interpretability

(00:57:43) - Does alignment research require scale?

(01:05:30) - Misuse vs misalignment

(01:09:06) - What if AI goes well?

(01:11:05) - China

(01:15:11) - How to think about alignment

(01:31:31) - Is modern security good enough?

(01:36:09) - Inefficiencies in training

(01:45:53) - Anthropic’s Long Term Benefit Trust

(01:51:18) - Is Claude conscious?

(01:56:14) - Keeping a low profile



Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Press play and read along

Runtime: 1h 58m

Transcript

Speaker 1 A generally well-educated human that could happen in you know two or three years what does that imply for anthropic when in two to three years these leviathans are doing like ten billion dollar trading runs the models they just want to learn and it was a bit like a zen cohen i listened to this and and i became enlightened

Speaker 1 the compute doesn't flow like the spice doesn't flow

Speaker 1 it's like you can't like like the the blob has to be unencumbered right the big acceleration that that happened late last year and and beginning of this year we didn't cause that and honestly i think if you look at the reaction of google that that might be 10 times more important than than anything else there was a running joke the way building agi would look like is you know there would be a data center next to a nuclear power plant next to a bunker but now it's 2030 what happens next what what are we doing with a superhuman god Okay, today I have the pleasure of speaking with Dario Amude, who is the CEO of Anthropic.

Speaker 2 And I'm really excited about this one. Dario, thank you so much for coming on the podcast.
Thanks for having me.

Speaker 2 First question, you have been one of the very few people who have seen scaling coming for years, more than five years.

Speaker 2 I don't know how long it's been, but as somebody who's seen it coming, what is fundamentally the explanation for why scaling works?

Speaker 2 Why is the universe organized such that if you throw big blobs of compute at a wide enough distribution of data, the thing becomes intelligent?

Speaker 1 I think the truth is that we still don't know. I think it's almost entirely an empirical fact.

Speaker 1 You know, I think it's a fact that you could kind of sense from the data and from a bunch of different places.

Speaker 1 But I think we don't still have a satisfying explanation for it. If I were to try to make one, but I'm just, I don't know, I'm just kind of waving my hands when I say this.
You know,

Speaker 1 there's this, there's these ideas in physics around like long tail or power law of like correlations or effects.

Speaker 1 And so like when a bunch of stuff happens, right, when you have a bunch of like features, you get a lot of the data in like kind of the early, you know, the

Speaker 1 fat part of the distribution before the tails. You know, for language, this would be things like, oh, I figured out there are parts of speech, and nouns follow verbs.

Speaker 1 And then there are these more and more and more and more subtle correlations.

Speaker 1 And so it kind of makes sense why there would be this, you know, every log or order of magnitude that you add, you kind of capture more of the distribution.

Speaker 1 What's not clear at all is why does it scale so smoothly with parameters? Why does it scale so smoothly with the amount of data?

Speaker 1 You can think up some explanations of why it's linear. Like the parameters are like a bucket, and so the data is like water.
And so the size of the bucket is proportional to the size of the water.

Speaker 1 But like, why does it lead to all these, this very smooth scaling? I think we still don't know. There's all these explanations.

Speaker 1 Our chief scientist, Jared Kaplan, did some stuff on like fractal manifold dimension that like you can use to explain it.

Speaker 1 So there's all kinds of ideas, but I feel like we we just don't really know for sure.

Speaker 2 And by the way, for the audience who's trying to follow along, by scaling, we're referring to the fact that you can very predictably see how if you go from GPT-3 to GPT-4, or in this case, Claude 1 to Cloud2, that the loss in terms of whether it can predict the next token scales very smoothly.

Speaker 2 So, okay, we don't know why it's happening, but can you at least predict if empirically

Speaker 2 Here is the loss at which this ability will emerge, here is the place where this circuit will emerge. Is that at all predictable, or are you just looking at the loss number?

Speaker 1 That is much less predictable. What's predictable is this statistical average, this loss, this entropy, and it's super predictable.

Speaker 1 It's like, you know, predictable to like sometimes even to several significant figures, which you don't see outside of physics, right? You don't expect to see it in this messy empirical field.

Speaker 1 But actually, specific abilities are very hard to predict. So, you know, back when I was working on GPT-2 and GPT-3, like when does arithmetic come in place? When do models learn to code? Sometimes

Speaker 1 it's very abrupt. You know, it's kind of like you can predict statistical averages of the weather, but the weather on one particular day is very, you know, very, very hard to predict.

Speaker 2 So

Speaker 2 dumb it down for me. I don't understand manifolds, but mechanistically, it doesn't know addition yet.
Now it knows addition. What has happened?

Speaker 1 This is another question that we don't know the answer to. I mean, we're trying to answer this with things like mechanistic interpretability, but

Speaker 1 I'm not sure.

Speaker 1 I mean, you can think about these things about like circuits snapping into place, although there is some evidence that when you look at the models being able to add things, that, you know, like if you look at its chance of getting the right answer, that shoots up all of a sudden.

Speaker 1 But if you look at, okay, what's the probability of the right answer?

Speaker 1 You'll see it climb from like one in a million to one in a hundred thousand to one in a thousand long before it actually gets the right answer.

Speaker 1 And so there's some continuous, in many of these cases, at least, I don't know if in all of them, there's some continuous process going on behind the scenes. I don't understand it at all.

Speaker 2 Does that imply that the circuit or the process for doing addition was pre-existing and it just got increased in salience?

Speaker 1 I don't know if like there's this circuit that's weak and getting stronger. I don't know if it's something that works but not very well.
Like

Speaker 1 I think we don't know. And these are some of the questions we're trying to answer with mechanistic interpretability.

Speaker 2 Are there abilities that won't emerge with scale?

Speaker 1 So I definitely think that, again, like things like alignment and values are not guaranteed to emerge with scale, right?

Speaker 1 It's kind of like, you know, one way to think about it is you train the model and it is

Speaker 1 basically it's like predicting the world. It's understanding the world.
Its job is facts, not values, right? It's trying to predict what comes next.

Speaker 1 But there's just, there's free variables here where it's like,

Speaker 1 what should you do? What should you think? What should you value? Those, you know, like the, they're just, there aren't the bits for that.

Speaker 1 There's just like, well, if I started with this, I should finish with this. If I started with this other thing, I should finish with this other thing.

Speaker 1 And so I think that's not going to emerge.

Speaker 2 I want to talk about a lemon in a second, but on scaling, if it turns out that scaling plateaus before we reach human-level intelligence, looking back on it, what would be your explanation?

Speaker 2 What do you think is likely to be the case if that turns out to be the outcome?

Speaker 1 Yeah, so I guess I would distinguish some problem with the fundamental theory with some practical issue. So one practical issue we could have is we could run out of data.

Speaker 1 For various reasons, I think that's not going to happen. But, you know, if you look at it very, very naively, we're not that far from running out of data.

Speaker 1 And so it's like we just don't have the data to

Speaker 1 continue the scaling curves. I think another way it could happen is like, oh, we just use up all of our compute that was available and that wasn't enough.
And then progress is slow after that.

Speaker 1 I wouldn't bet on either of those things happening, but they could.

Speaker 1 I think from a fundamental perspective,

Speaker 1 personally, I think it's very unlikely that the scaling laws will just stop. If they do, another reason, again, this isn't fully fundamental, could just be we don't have quite the right architecture.

Speaker 1 Like, if we tried to do it with an LSTM or an RNN, the slope would be different.

Speaker 1 I still might be that we get there, but I think there are some things that are just very hard to represent when you don't have this ability to attend far in the past that transformers have.

Speaker 1 If somehow, and I don't know how we would know this, it kind of wasn't about the architecture and we just hit a wall, I think I'd be very surprised by that.

Speaker 1 I think we're already at the point where the things the models can't do don't seem to me to be different in kind from the things they can do.

Speaker 1 And it just,

Speaker 1 you know, you could have made a case a few years ago that it was like they can't reason, they can't program. Like you could have,

Speaker 1 you could have drawn boundaries and said, well, maybe you'll hit a wall. I didn't think that.
I didn't think we would hit a wall.

Speaker 1 A few other people didn't think we would hit a wall, but it was a more plausible case then. I think it's a less plausible case now.
Now, it could happen. Like, this stuff is crazy.
Like,

Speaker 1 it could happen tomorrow that it's just like we hit a wall.

Speaker 1 I think if that happens, I'm trying to think of like, what's my, what would really be my, it's unlikely, but what would really be my explanation?

Speaker 1 I think my explanation would be there's something wrong with the loss when you train on next word prediction.

Speaker 1 Like some of the remaining like reasoning abilities or something like that, like if you really want to learn, you know, it's a program at a really high level, like, it means you care about some tokens much more than others.

Speaker 1 And they're rare enough that it's like the loss function over focuses on kind of the

Speaker 1 appearance, the things that are responsible for the most bits of entropy.

Speaker 1 And instead, you know, they don't focus on this stuff that's really essential. And so you could kind of have the signal drowned out in the noise.

Speaker 1 I don't think it's going to play out that way for a number of reasons.

Speaker 1 But if you told me, yep, you trained your 2024 model, it was much bigger, and it just wasn't any better, and you tried every architecture, and it didn't work,

Speaker 1 I think that's the explanation I would reach for.

Speaker 2 Is there a candidate for another loss function if you had to abandon next token prediction?

Speaker 1 I think then you would have to go for some kind of RL. And again, there's many different kinds.
There's RL from human feedback. There's RL against an objective.
There's things like constitutional AI.

Speaker 1 There's things like amplification and debate, right? These are kind of both alignment methods and ways of training models.

Speaker 1 You would have to try a bunch of things, but the focus would have to be on what do we actually care about the model doing, right?

Speaker 1 And in a sense, we're a little bit lucky that it's like predict the next word gets us all these other other things we need. Right.
There's no guarantee.

Speaker 2 It seems like from your worldview, there's a multitude of different loss functions that it's just a matter of what can allow you to just throw a whole bunch of data at it.

Speaker 2 Like the next token prediction itself is not significant.

Speaker 1 Yeah. Well, I mean, I guess the thing with RL is you get slowed down a bit because it's like, you know, you have to, by some method, kind of, you know, design how the loss function works.

Speaker 1 Nice thing with the next token prediction is it's there for you, right? It's just there. It's the easiest thing in the world.

Speaker 1 And so I think it would slow you down if you couldn't scale in just that very simplest way.

Speaker 2 You mentioned that data is likely not to be the constrained.

Speaker 2 Why do you think that is the case?

Speaker 1 There's various possibilities here. And, you know, for a number of reasons, I shouldn't go into the details, but

Speaker 1 there's many sources of data in the world, and there's many ways that you can also generate data.

Speaker 1 My guess is that this will not be a blocker. Maybe it'd be better if it was, but it won't be.

Speaker 2 Are you talking about multimodal?

Speaker 1 There's just many different ways to do it.

Speaker 2 How did you form your views on scaling? How far back can we go? And then you would be basically saying something similar to this.

Speaker 1 This view that I have probably formed gradually from, I would say, like 2014 to 2017.

Speaker 1 So I think my first experience with it was my first experience with AI.

Speaker 1 So I saw some of the early stuff around AlexNed in 2012. Always kind of had wanted to study intelligence, but I, you know, before I was just like, this isn't really working.

Speaker 1 Like, it doesn't seem like it's actually working.

Speaker 1 You know, all the way back to like, you know, 2005, I'd like, you know, I'd read Rake Hertzwell's work.

Speaker 1 You know, I'd read even some of like Eliezer's work on the early, on the early internet back then. And I was like, ah, this, this stuff kind of looks far away.

Speaker 1 Like, I look at the AI stuff of today, and it's like not anywhere, not anywhere close. But with Alex and I was like, oh, this is actually, stuff is actually starting to work.

Speaker 1 So I joined Andrew Ng's group initially at Baidu. And the first task that I got set to do, right, was my, you know, I'd been in a different field.

Speaker 1 And so I first joined, you know, this was my first experience with AI.

Speaker 1 And it was a bit different from a lot of the kind of academic style research that was going on kind of elsewhere in the world, right?

Speaker 1 I think I kind of got lucky in that the task that was given to me and the other folks there was just make the best speech recognition

Speaker 1 system that you can. And there was a lot of data available.
There were a lot of GPUs available.

Speaker 1 So it kind of, it posed the problem in a way that was amenable to discovering that that kind of scaling was a solution, right?

Speaker 1 That's very different from like, you're a postdoc and it's your job to come up with, you know, what's the, what's the best, like, you know, what's, what's an idea that seems clever and new and makes your mark as someone who's invented something.

Speaker 1 And so I just quickly discovered that like, you know, I was just tried the simplest experiments. I was like, you know, just fiddling with some dials.
I was like, okay, try

Speaker 1 you know, try try adding more layers to the RNA. Literally, add more layers to the RNN.

Speaker 1 You know, try Try training it for longer. What happens? How long does it take to overfit? What if I add new data and repeat it less times? And I just saw these very consistent patterns.

Speaker 1 I didn't really

Speaker 1 know that this was unusual or that others weren't thinking in this way. This was just kind of like almost like beginner's luck.
It was my first experience with it.

Speaker 1 And I didn't really think about it beyond speech recognition, right?

Speaker 1 You know, I was just kind of like, oh, this is, you know, I don't know anything about this field.

Speaker 1 There's zillions of things people do with machine learning, but like, I'm like, weird, this seems to be true in the speech recognition field.

Speaker 1 And, and, and then I think it was recently, you know, like, um, just before Open AI started, um, that I met Ilya, who you interviewed.

Speaker 1 One of the first things he said to me was, Look, the models, they just want to learn. You have to understand this.
The models, they just want to learn. And it was a bit like a Zen Cohen.

Speaker 1 Like, I kind of like, I listened to this and I became enlightened.

Speaker 1 And, you know, over the years, over the years after this, you know, you know, again, I would be kind of, you know, the one who would formalize a lot of these things and kind of put them together.

Speaker 1 But like, just kind of the, the, what that told me is that that phenomenon that I'd seen wasn't just some random thing that I'd seen. It was like, it was broad.
It was, it was more general, right?

Speaker 1 The models, the models just want to learn. You get the obstacles out of their way, right? You give them, you give them good data.
You, you give them enough space to operate in.

Speaker 1 You don't do something stupid like condition them badly numerically.

Speaker 1 And they want to learn. They'll do it.
They'll do it.

Speaker 2 You know what?

Speaker 2 What I find really interesting about what you said is there were many people who were aware back at that time, probably weren't working on it directly, but were aware that these things are really good at speech recognition or at playing these constrained games.

Speaker 2 Very few extrapolated from there, like you and Ilya did, to something that is generally intelligent.

Speaker 2 What was different about the way you were thinking about it versus how others think that you went from like, it's getting better at speech in this consistent way, it will get better at everything in this consistent way.

Speaker 1 Yeah, so I genuinely don't know. I mean, at first when I saw it for speech, I assumed this was just true for speech or for this narrow class of models.

Speaker 1 I think it was just over the period between 2014 and 2017, I tried it for a lot of things and saw the same thing over and over again. I watched the same being true with Dota.

Speaker 1 I watched the same being true with robotics, which many people thought of as a counterexample, but I just thought, well, it's hard to get data for robotics, but if we operate within, if we look within the data that we have, we see the same patterns.

Speaker 1 And so

Speaker 1 I don't know. I think people were very focused on solving the problem in front of them.
Why one person thinks one way, another person thinks, it's very, it's very hard to explain.

Speaker 1 I think people just see it through a different lens, you know, are looking like vertically instead of horizontally. They're not thinking about the scaling.

Speaker 1 They're thinking about how do I solve my problem? And well, for robotics, there's not enough data. And so, you know,

Speaker 1 and so, you know, that can easily abstract to, well, scaling doesn't work because we don't have the data. And so

Speaker 1 I don't know. I just, for some reason, and it may just, it may just have been random chance, was obsessed with that particular direction.

Speaker 2 When did it become obvious to you that language is the means to just feed a bunch of data into these things?

Speaker 2 Or was it just you ran out of other things? Like robotics, there's not enough data. This other thing, there's not enough data.

Speaker 1 Yeah, I mean, I think this whole idea of like the next word prediction that you could do self-supervised learning, you know, that together with the idea that it's like, wow, for predicting the next word, there's so much richness and structure there, right?

Speaker 1 You know, it might say two plus two equals and you have to know the answer is four.

Speaker 1 And, you know, it might be telling the story about a character and then basically it's it's posing to the model, you know, the, the equivalent of these developmental tests that get posed to children.

Speaker 1 You know, Mary walks into the room and, you know, puts an item in there. And then, you know, Chuck walks into the room and removes the item and Mary doesn't see it.
What does Mary think happen?

Speaker 1 So the models are going to have to, to get this right in the service of predicting the next word, they're going to have to solve, you know, solve all these theory of mind problems, solve all these math problems.

Speaker 1 And so I, you know, I, I, my thinking was just, well, you know, you scale it up as much as you can. You, you, you, you, you know, there's, there's kind of no limit to it.

Speaker 1 And I think I kind of had abstractly that view. But the thing, of course, that like really

Speaker 1 solidified and convinced me was the work that Alec Radford did on GPT-1,

Speaker 1 which was not only could you get this language model that could predict things very well, but also you could fine-tune it. You needed to fine-tune it in those days to do all these other tasks.

Speaker 1 And so I was like, wow, you know, this isn't just some narrow thing where you get the language model right. It's sort of halfway to everywhere, right?

Speaker 1 It's like, you know, you get the language model right. And then with a little move in this direction, it can, you know, it can solve this, this, you know, logical dereference test or whatever.

Speaker 1 And, you know, with with this other thing,

Speaker 1 it can solve translation or something. And then you're like, wow, I think there's really something to do.
And of course, we can really scale it.

Speaker 2 One thing that's confusing or that would have been hard to see, if you told me in 2018, we'll have models in 2023, like Law 2, that can write theorems in the style of Shakespeare or whatever theory you want.

Speaker 2 They can A standardized tests with open-ended questions,

Speaker 2 just all kinds of really impressive things. You would have have said at that time, I would have said, oh, you have AGI.
You clearly have something that is a human-level intelligence.

Speaker 2 Where while these things are impressive, it clearly seems we're not at human level, at least in the current generation and potentially for generations to come.

Speaker 2 What explains this discrepancy between super impressive performance in these benchmarks and in just like the things you could describe versus general?

Speaker 1 So that was one area where actually I was not pressing, and I was surprised as well. Yeah.

Speaker 1 So when I first looked at GPT-3 and, you know, more so the kind of things that we built in the early days at Anthropic,

Speaker 1 my general sense was: I, you know, I looked at these and I'm like, it seems like they've really grasped the essence of language. I'm not sure how much we need to scale them up.

Speaker 1 Like, maybe we, maybe what's more needed from here is like RL and

Speaker 1 kind of all the other stuff.

Speaker 1 Like we might be kind of near the, you know, I thought in 2020, like we can scale this a bunch more, but I wonder if it's more efficient to scale it more or to start adding on these other objectives like like rl i thought maybe if you do as much rl

Speaker 1 as you know as as you've done pre-training for a for a you know 2020 style model that that's that's the way to go and scaling it up will keep working but you know is that is that really the best path and i i think it i don't know it just keeps going like i thought it had understood a lot of the essence of language but then you know there's there's kind of there's kind of further to go um and and so i don't know know, stepping back from it, like one of the reasons why I'm sort of very empiricist about

Speaker 1 AI, about safety, about organizations, is that you often get surprised, right?

Speaker 1 You know, I feel like I've been right about some things, but I've still, you know, with these theoretical pictures ahead, been wrong about most things.

Speaker 1 Being right about 10% of the stuff sets you head and shoulders

Speaker 1 above many people.

Speaker 1 If you look back to, I can't remember who it was, kind of, you know, made these diagrams that are like, you know, here's, here's the village idiot, here's Einstein, here's the scale of intelligence, right?

Speaker 1 And the village idiot and Einstein are like very close to each other. Like that, maybe that's still true in some abstract sense or something, but it's not really what we're seeing, is it?

Speaker 1 We're seeing like that it seems like the human range is pretty broad and doesn't, we don't hit the human range in the same place or at the same time for different tasks, right?

Speaker 1 Like, you know, like write, write a sonnet, you know, in the style of Cormick McCarthy or something. Like, I don't know, I'm not very creative, so I couldn't do that.

Speaker 1 But, like, you know, that's that's a pretty high-level human skill, right? Um, and even the model is starting to get good at stuff of, you know, like constrained writing.

Speaker 1 You know, there's this like, write a, you know, write a page without using the letter E or something, like, write a page about X without using the letter E.

Speaker 1 Like, I think the models might be like superhuman or close to superhuman at that.

Speaker 1 But when it comes to, you know,

Speaker 1 yeah, I don't know, prove relatively simple mathematical theorems, like they're, they're just starting to do the beginning of it.

Speaker 1 They make really dumb mistakes sometimes, and they really lack any kind of broad, like, you know,

Speaker 1 correcting your errors or doing some extended task. And so, I don't know, it turns out that intelligence isn't a spectrum.
There are a bunch of different areas of domain expertise.

Speaker 1 There are a bunch of different kinds of skills, like memories, different. I mean,

Speaker 1 it's all formed in the blob.

Speaker 1 It's all formed in the blob. It's not complicated, but to the extent it even is on the spectrum, the spectrum is also wide.

Speaker 1 If you asked me 10 years ago, that's not what I would have expected at all, but I think that's very much the way it's turned out.

Speaker 2 Oh, man. I have so many questions.

Speaker 2 Just as follow-up on that, one is: do you expect that, given the distribution of training that these models get from massive amounts of internet data versus what humans got from evolution, that the repertoire of skills that elicits will be just barely overlapping, it will be like concentric circles.

Speaker 1 How do you think about,

Speaker 2 do those matter?

Speaker 1 Clearly,

Speaker 1 there's certainly a large amount of overlap, right?

Speaker 1 Because a lot of the things, you know, like these models have business applications, and many of their business applications are doing things that, you know, or helping humans to be more effective at things.

Speaker 1 So the overlap is quite large. And, you know, if you think of all the activity activity that humans put on the internet in text, that covers a lot of it.
But it probably doesn't cover some things.

Speaker 1 Like the models, I think they do learn a physical model of the world to some extent, but they certainly don't learn how to actually move around in the world.

Speaker 1 Again, maybe that's easy to fine-tune. But

Speaker 1 I think there are some things that the models don't learn that humans do. And then I think

Speaker 1 the models learn, for example, to speak fluent base 64. I don't know about you, but I never learned that.

Speaker 2 How likely do you think it is that these models will be superhuman for many years at economically valuable tasks while they are still below humans in many other relevant tasks that prevents like an intelligence explosion or something?

Speaker 1 I think this kind of stuff is like really hard to know.

Speaker 1 So I'll give that caveat that like, you know, again, like the basic scaling laws you can kind of predict.

Speaker 1 And then like this more granular stuff, which we really want to know to know how this all is going to go, is much harder to know.

Speaker 1 But my guess would be the scaling laws are going to continue you know again subject to you know do people slow down for safety or for regulatory reasons um but you know let's just let's just put all that aside and say like we have the economic capability to keep scaling if we did that what would happen and i i think my view is we're going to keep getting better across the board and i don't see any area where the models are like super super weak or not starting to make progress like that used to be true of like math and programming but I think over the last six months, you know, the 2023 generation of models compared to the 2022 generation has started to learn that.

Speaker 1 There may be more subtle things we don't know. And so I kind of suspect, even if it isn't quite even, that the rising tide will lift all the boats.

Speaker 2 Does that include the thing you were mentioning earlier, where if there's an extended task, it kind of loses its train of thought

Speaker 2 or its ability to just like execute a series of tasks?

Speaker 1 So I think that that's going to depend on things like RL training to have the model do longer horizon tasks. I don't expect that to require a substantial amount of additional compute.

Speaker 1 I think that

Speaker 1 that was probably an artifact of, yeah, kind of thinking about RL in the wrong way and underestimating how much the model had learned on its own.

Speaker 1 In terms of, you know, are we going to be superhuman in some areas and not others? I think it's complicated.

Speaker 1 I could imagine that we won't be superhuman in some areas because, for example, they involve like embodiment in the physical world. And then it's like, what happens?

Speaker 1 Like, do the AIs help us train faster AIs and those faster AIs wrap around and solve that? Do you not need the physical world? It depends what you mean. Are we worried about an alignment disaster?

Speaker 1 Are we worried about misuse, like making weapons of mass destruction? Are we worried about the AI,

Speaker 1 or you know, the AI taking over research from humans? Are we worried about it reaching some threshold of economic productivity where it can do what the average human?

Speaker 1 These different thresholds, I think, have different answers. Although I suspect they will all come within a few years.

Speaker 2 Let me ask about those thresholds. So if Claude was an employee at Anthropic, what salary would it be worth? Is it like meaningfully speeding up AI progress?

Speaker 1 It feels to me like an intern in most areas,

Speaker 1 but then some specific areas where it's better than that. Again, I think one thing that makes the comparison hard is like the form factor is kind of like not the same as a human, right?

Speaker 1 Like a human, like, you know, if you were to behave like one of these chatbots, like we wouldn't really, I mean, I guess we could have this conversation.

Speaker 1 It's like, but, you know, they're, they're not really, they're more designed to answer single or a few questions, right?

Speaker 1 And like, you know, they don't have the concept of having a long life of prior experience, right? We're talking here about, you know, things that

Speaker 1 I've experienced in the past, right? And chatbots don't don't have that. And so there's, there's all kinds of stuff missing.

Speaker 2 And so it's hard to make a comparison but i don't know it it they feel like interns in some areas and kind of then they have areas where they spike and are really savants where they may be better than they may be better than anyone here but does the overall picture of something like an intelligence explosion you know my my former guest is carl schoman and he has this like very detailed model of an intelligence does that as somebody who would actually like see that happening does that make sense to you as they go from interns to entry-level software engineers those entry-level software engineers increase your productivity.

Speaker 1 I think the idea that

Speaker 1 the AI systems become more productive and first they speed up the productivity of humans, then they kind of equal the productivity of humans,

Speaker 1 and then they're in some meaningful sense the main contributor to scientific progress, that that happens at some point.

Speaker 1 I think that that basic logic seems likely to me, although I have a suspicion that when we actually go into the details, it's going to be kind of like weird and different than we expect.

Speaker 1 That all the detailed models are kind of,

Speaker 1 you know, we're thinking about the wrong things, or we're right about one thing, and then are wrong about 10 other things. And so, I don't know.

Speaker 1 I think we might end up in like a weirder world than we expect.

Speaker 2 When you add all this together, like your estimate of when we get something kind of human level, what does that look like?

Speaker 1 I mean, again, it depends on the thresholds.

Speaker 1 You know, in terms of someone looks at these, the model, and, you know, even if you talk to it for, you know, for an hour or so,

Speaker 1 you know, it's basically like a generally well-educated human. Yeah.

Speaker 1 That could be not very far away at all, I think.

Speaker 1 Like that, that could happen in two or three years. Like,

Speaker 1 you know, if I look at, again, like, I think the main thing that would stop it would be if if we hit certain, certain, you know, and we have internal tests for, you know, safety thresholds and stuff like that.

Speaker 1 So if a company or the industry decides to slow down or, you know, we're able to get the government to institute restrictions that kind of, you know, that moderate the rate of progress for safety reasons, that would be the main reason it wouldn't happen.

Speaker 1 But if you, if you just look at the logistical and economic ability to scale, I don't think we're very far at all from that.

Speaker 1 Now, that, that may not be the threshold where the models are existentially dangerous. In fact, I suspect it's not quite there yet.

Speaker 1 It may not be the threshold where the models can take over most AI research. It may not be the threshold where the models

Speaker 1 seriously change how the economy works.

Speaker 1 I think it gets a little murky after that. And all those thresholds may happen at various times after that.
But I think, in terms of the base technical capability,

Speaker 1 it kind of sounds like a reasonably generally educated human across the board. I think that could be quite close.

Speaker 2 Why would it be the case that it could be sound, you know, pass a Turing test for an educated person, but not be able to contribute or substitute for human involvement in the economy?

Speaker 1 A couple reasons. One is just

Speaker 1 that the threshold of skill isn't high enough, right? Comparative advantage. It's like it doesn't matter that I have someone who's better than the average human at every task.

Speaker 1 Like what I really need is for AI research, like, you know, I need what, you know, I need to basically find something that is strong enough to substantially accelerate, you know, the like labor of the thousand experts who are best at it.

Speaker 1 And so we might reach a point where we, you know, the comparative advantage of these systems is not, is not great.

Speaker 1 Another thing that could be the case is that I think there are these kind of mysterious frictions that like, you know, kind of don't show up in naive economic models, but you see it whenever whenever you're like, you know, when you go to a customer or something and you're like, hey, I have this cool chat bot.

Speaker 1 In principle, it can do everything that, you know, your customer service bot does or that this part of your company does. But like the actual friction of like, how do we slot it in?

Speaker 1 How do we make it work?

Speaker 1 That, that includes both kind of like, you know, just the question of how it works in a human sense within the company, like, you know, how, how, how things happen in the economy and overcome frictions.

Speaker 1 And also just like, what is the workflow? How do you actually interact with it? It's very different to say, here's a chat bot that kind of looks like it's doing this task that

Speaker 1 or, you know, or helping the human to do, to do some task, as it is to say, like, okay, this thing is, this thing is deployed and 100,000 people are using it.

Speaker 1 Often, like right now, lots of folks are rushing to deploy these systems, but I think in many cases, they're not using them in anywhere close to the most efficient way that they could.

Speaker 1 You know, not because they're not smart, but because it takes time to work these things out. And so I think when things are changing this fast, there are going to be all of these frictions.
Yeah.

Speaker 1 And I, and I think, again, these are messy reality that doesn't quite get captured in the model. I don't think it changes the basic picture.
Like, I don't think it changes the idea that we're...

Speaker 1 we're building up this snowball of like the models help the models get better and you know do what the humans and and you know can can accelerate what the humans do and eventually it's mostly the models doing the work like you zoom out far enough that's happening But I'm kind of skeptical of kind of any kind of precise mathematical or exponential prediction of how it's going to be.

Speaker 1 I think it's all going to be a mess, but I think what we know is it's on a metaphorical exponential and it's going to happen fast.

Speaker 2 How do those different exponentials net out, which we've been talking about? So, one was

Speaker 2 the scaling laws themselves are power laws with decaying marginal

Speaker 2 loss per parameter or something. The other exponential you talked about is, well, these things can get involved in the process of AI research itself, speeding it up.

Speaker 2 So those two are sort of opposing exponentials. Does it net out to be super linear or sublinear? And also, you mentioned, well, the distribution of intelligence might just be broader.

Speaker 2 So should we expect

Speaker 2 after we get to this point in two to three years, it's like, vom, vom, like, what does that look like?

Speaker 1 It's, I mean, I think it's very unclear, right? So we're already at the point where, if you look at the loss, the scaling laws are starting to bend.

Speaker 1 I mean, we've seen that in published model cards offered by multiple companies. So that's not a secret at all.

Speaker 1 But as they start to bend, each little bit of entropy, right, of accurate prediction becomes more important, right?

Speaker 1 Maybe these last little bits of entropy are like, well, you know, this is a physics paper as Einstein would have written it, as opposed to, you know, as some other physicist

Speaker 1 would have written it. And so it's hard to assess significance from this.

Speaker 1 It certainly looks like in terms of practical performance, the metrics keep going up relatively linearly, although they're always unpredictable.

Speaker 1 So it's hard to see that. And then,

Speaker 1 I mean, the thing that I think is driving the most acceleration is just more and more money is going into the field. Like people are seeing that there's just a huge amount of

Speaker 1 economic value. And so I expect the price, the amount of money spent on the largest models to go up by like a factor of 100 or something.

Speaker 1 And for that then to be concatenated with the chips are getting faster, the algorithms are getting better because there's so many people working on this now. And so, and so, again, I mean,

Speaker 1 I'm not making a normative statement here, this is what should happen.

Speaker 1 I'm not even saying this necessarily will happen because I think there's important safety and government questions here, which we're very actively working on.

Speaker 1 I'm just saying, like, left to itself, this is what the economy is going to do.

Speaker 2 We'll get to those questions in a second, but how do you think about the contribution of Anthropic to that increasing in the scope of this industry?

Speaker 2 Where, I mean, there's an argument you make that listen, with that investment, we can work on safety stuff at Anthropic. Another that says you're raising the salience of this field in general.

Speaker 1 Yeah, I mean, it's all costs and benefits, right? The costs are not zero, right?

Speaker 1 So I think a mature way to think about these things is, you know, not to deny that there are any costs, but to think about what the costs are and what the benefits are.

Speaker 1 You know, I think we've been relatively responsible in the sense that, you know, the big acceleration that happened late last year and beginning of this year,

Speaker 1 we didn't cause that.

Speaker 1 We weren't the ones who did that. And honestly, I think if you look at the reaction of Google, that that might be 10 times more important than anything else.

Speaker 1 And then kind of once it had happened, once the ecosystem had changed, then we did a lot of things to kind of stay on the frontier.

Speaker 1 And so, I don't know,

Speaker 1 it's like any other question, right? It's like

Speaker 1 you're trying to do the things that have the biggest costs and that have the lowest costs and the biggest benefits.

Speaker 1 And

Speaker 1 that causes you to have different strategies at different times.

Speaker 2 One question I had for you while we were talking about the intelligence stuff was, listen, as a scientist yourself,

Speaker 2 what do you make of the fact that these things have basically the entire corpus of human knowledge memorized?

Speaker 2 And as far as I'm aware, they haven't been able to make like a single new connection that has led to a discovery.

Speaker 2 Whereas if even a moderately intelligent person had this much stuff memorized, they'd notice, oh, this thing causes this symptom. This other thing also causes this symptom.

Speaker 2 There's a medical medical cure right here, right?

Speaker 2 Shouldn't we be expecting that kind of stuff?

Speaker 1 I'm not sure. I mean, I think,

Speaker 1 you know, I don't know, these words, discovery, creativity, like it's one of the lessons I've learned is that

Speaker 1 in kind of the big blob of compute, often these ideas often end up being kind of fuzzy and elusive and hard to track down. But I think there is something here, which is,

Speaker 1 I think the models do display a kind of ordinary creativity. Again, again, you know, the kind of like, you know,

Speaker 1 write a sonnet, you know, in the style of Cormac McCarthy or Barbie or something, you know, like there is some creativity to that.

Speaker 1 And I think they do draw, you know, new connections of the kind that an ordinary person would draw.

Speaker 1 I agree with you that there haven't been any kind of like, I don't know, like, I would say like big scientific discoveries.

Speaker 1 I think that's a mix of like, just the model's skill level is not, is not high enough yet, right? Like I was on a podcast last week where

Speaker 1 the host said, I don't know, I played with these models. They're kind of mid, right? Like they get, you know, they get a B or a B minus or something.

Speaker 1 And that, I think, is going to change with the scaling. I do think there's an interesting point about, well, the models have an advantage, which is they know a lot more than us.

Speaker 1 You know, like, should they have an advantage already, even if, even if their skill level isn't quite high. Maybe that's kind of what you're getting at.
I don't really have an answer to that.

Speaker 1 I mean, it seems certainly like memorization and facts and drawing connections is an area where the models are ahead.

Speaker 1 And I do think maybe you need those connections and you need a fairly high level of skill.

Speaker 1 I do think, particularly in the area of biology, for better and for worse, the complexity of biology is such that the current models know a lot of things right now.

Speaker 1 And that's what you need to make discoveries and draw. It's not like physics where you need to, you know, you need to think and come up with a formula.
In biology, you need to know a lot of things.

Speaker 1 And so I do think the models know a lot of things and they have a skill level that's not quite high enough to put them together.

Speaker 1 And I think they are, they are just on the cusp of being able to put these things together.

Speaker 2 On that point, last week in your Senate testimony, you said that these models are two to three years away from potentially enabling large-scale bioterrorism attacks or something like that.

Speaker 2 Can you make that more concrete without obviously giving the kind of information that would

Speaker 2 but is it like one-shotting how to weaponize something? Is it or do you had to fine-tune an open source model? Like, what would that actually look like?

Speaker 1 I think it'd be good to clarify this because we did a blog post in the Senate testimony and like, I think various people kind of didn't understand the point or didn't understand what we'd done.

Speaker 1 So I think today, and you know, of course, in our models, we try and prevent this, but there's always jail breaks.

Speaker 1 You can ask the models all kinds of things about biology and get them to say all kinds of scary things.

Speaker 1 But often those scary things are things that you could Google. And I'm therefore not particularly worried about that.

Speaker 1 I think it's actually an impediment to seeing the real danger where, you know, someone just says, oh, I asked this model, like, you know, for the smallpox, you know, for to tell me some things about smallpox and it will.

Speaker 1 That is actually, you know, kind of not what I'm worried about. So we spent about six months working with some of basically some of the folks who are the most expert in the world on

Speaker 1 how do biological attacks happen?

Speaker 1 You know, what would you need to conduct such an attack and how do we defend against such an attack?

Speaker 1 They worked very intensively on just the entire workflow of if if I were trying to do a bad thing, it's not one shot. It's a long process.
There are many steps to it.

Speaker 1 It's not just like I asked the model for this one page of information.

Speaker 1 And again, without going into any detail, the thing I said in the Senate testimony is like, there's some steps where you can just get information on Google.

Speaker 1 There are some steps that are what I'd call missing. They're scattered across a bunch of textbooks.
or they're not in any textbook. They're kind of implicit knowledge.

Speaker 1 And they're not really like, they're not explicit knowledge.

Speaker 1 They're more like, I have to do this lab protocol. And like, what if I get it wrong? Oh, if this happened, then my temperature was too low.

Speaker 1 If that happened, I needed to add more of this particular reagent. What we found is that for the most part,

Speaker 1 those key missing pieces, the models can't do them yet. But we found that sometimes they can.

Speaker 1 And when they can, sometimes they still hallucinate, which is the thing that's kind of keeping us safe. But we saw enough signs of the models doing those key things well.

Speaker 1 And if we look at, you know, state-of-the-art models and go backwards to previous models, we look at the trend, it shows every sign of two or three years from now,

Speaker 1 we're going to have a real problem.

Speaker 2 Yeah, especially the thing you mentioned on the log scale, you go from like one in 100 times it gets a right to one in 10 to.

Speaker 1 Exactly. So, you know, I've seen many of these like Groks in my life, right?

Speaker 1 I was there when I watched when GPT-3 learned to do arithmetic, when GPT-2 learned to do regression a little bit above chance, when, you know, when we got, you know, with Claw and we got better on like, you know,

Speaker 1 all these tests of helpful, honest, harmless. I've seen a lot of Groks.
This is, this is unfortunately not one that I'm excited about, but I believe it's happening.

Speaker 2 So somebody might say, listen, you were a co-author on this post that OpenAI released about GPT-2, where they said, you know, we're not going to release the weights or the details here because we're worried that this model will be used for something, you know, bad.

Speaker 2 And looking back on it, now it's laughable to think that GPT-2 could have done anything bad. Are we just like way too worried? This is a concern that doesn't make sense for it is interesting.

Speaker 1 It might be worth looking back at the actual text of that post.

Speaker 1 So I don't remember it exactly, but it should, you know, it's still up on the internet.

Speaker 1 It says something like, you know, we're choosing not to release the weights because of concerns about misuse, but it also said, this is an experiment.

Speaker 1 We're not sure if this is necessary or the right thing to do at this time, but we'd like to establish a norm of thinking carefully about these things.

Speaker 1 You know, you could think of it a little like the, you know, the Sillimer conference in the 1970s, right? Where it's like, you know, they were just figuring out recombinant DNA.

Speaker 1 You know, it was not necessarily the case that someone could do something really bad with recombinant DNA. It's just the possibilities were starting to become clear.

Speaker 1 Those words, at least, were the the right attitude. Now, I think there's a separate thing that, like,

Speaker 1 you know, people don't just judge the post, they judge the organization. Is this an organization that, you know, is produces a lot of hype or that has credibility or something like that?

Speaker 1 And so I think that had some effect on it. I guess you could also ask, like, is it inevitable that people would just interpret it as like,

Speaker 1 you know, you can't get across any message more complicated than this thing right here is dangerous.

Speaker 1 So you can argue about those, but I think the the basic thing that was in my head and the head, the head of others

Speaker 1 who were involved in that, and I think

Speaker 1 what is evident in the post is like, we actually don't know. We have pretty wide error bars on what's dangerous and what's not.

Speaker 1 So we should, you know, like, we want to establish a norm of being careful. I think, by the way, we have enormously more evidence.
We've seen enormously more of these Groks now.

Speaker 1 And so we're well calibrated, but there's still uncertainty, right? In all these statements, I said, like, in two or three years, we might be there, right?

Speaker 1 There's a substantial risk of it, and we don't want to take that risk. But, you know, I wouldn't say it's, it's 100%.
It could be 50-50.

Speaker 2 Okay, let's talk about cybersecurity, which, in addition to bio-risk, is another thing Anthropic has been emphasizing. How have you avoided the cloud micro architecture from leaking?

Speaker 2 Because as you know, your competitors have been less successful at this kind of security.

Speaker 1 Can't comment on anyone else's security. Don't know what's going on in there.
A thing that we have done is,

Speaker 1 you know, so

Speaker 1 there are these architectural innovations, right, that make training more efficient. We call them compute multipliers because they're the equivalent of, you know,

Speaker 1 improving,

Speaker 1 you know,

Speaker 1 they're like having more compute. Our compute multipliers, again, I don't want to say too much about it because it could allow an adversary to counteract

Speaker 1 our measures, but we limit the number of people who are aware of

Speaker 1 a given compute multiplier to those who need to know about it.

Speaker 1 And so there's a very small number of people who could leak all of these secrets. There's a larger number of people who could leak one of them.

Speaker 1 But, you know, this is the standard compartmentalization strategy that's used in the intelligence community or

Speaker 1 resistant cells or whatever.

Speaker 1 So

Speaker 1 we've over the last

Speaker 1 few months, we've implemented these measures. So I don't want to jinx anything by saying, oh, this could never happen to us.
But I think it would be harder for it to happen.

Speaker 1 I don't want to go into any more detail. And, you know, by the way, I'd encourage all the other companies to do this as well.

Speaker 1 As much as

Speaker 1 competitors' architectures leaking is narrowly helpful to Anthropic, it's not good for anyone in the long run, right?

Speaker 1 So security around this stuff is really important.

Speaker 2 Even with all the security you have, could you, with your current security, prevent a dedicated state-level actor from getting the Claw 2 weights?

Speaker 1 It depends how dedicated is what I would say.

Speaker 1 Our head of security, who

Speaker 1 used to work on security for Chrome, which

Speaker 1 very widely used and attacked application. He likes to think about it in terms of how much would it cost to attack Anthropic successfully.

Speaker 1 Again, I don't want to go into super detail of how much I think it will cost to attack, and it's just kind of inviting people, but like one of our goals is that it costs more to attack Anthropic than it costs to just train your own model,

Speaker 1 which doesn't guarantee things because, you know, of course, you need the talent as well. So you might still, but, you know, but attacks have

Speaker 1 risks, diplomatic costs. And, you know,

Speaker 1 and they use up the very sparse resources that nation state actors might have in order

Speaker 1 to do the attacks.

Speaker 1 So we're not there yet, by the way,

Speaker 1 but

Speaker 1 I think we're to a very high standard compared to the size of company that we are. Like, I think if you look at security for most 150-person companies, like, I think there's just no comparison.

Speaker 1 But, you know,

Speaker 1 could we resist if it was a state actor's top priority to steal our model weights? No, they would succeed.

Speaker 2 How long does that stay true? Because at some point, the value keeps increasing and increasing.

Speaker 2 And another part of this question is that what kind of a secret is how to train cloud three or cloud two? Is it, you know, with nuclear weapons, for example, we have lots of spies.

Speaker 2 You just take a blueprint across and that's you, the implosion device, and that's what you need. Here, is it just more tacit, like the thing you were talking about with biology?

Speaker 2 You need to know how these reagents work. Is it just like you got the blueprint, you got the microarchitecture and the hyper parameters?

Speaker 1 There are some things that are like, you know, a one-line equation, and there are other things that are more complicated. Yeah.
And I think compartmentalization is the

Speaker 1 best way to do it. Just limit the number of people who know about something.
If you're a thousand-person company and everyone knows every secret, like one, I guarantee you have a leaker.

Speaker 1 And two, I guarantee you have a spy, like a literal spy.

Speaker 2 Okay, let's talk about alignment. And let's talk about mechanistic interpretability, which is the branch of which you guys specialize in.

Speaker 2 While you're answering this question, you might want to explain what mechanistic interpretability is. But just the broader question is, mechanistically, what is alignment?

Speaker 2 Is it that you're locking in the model into a benevolent character? Are you disabling deceptive circuits and procedures? Like what concretely is happening when you align a model?

Speaker 1 I think as with most things, you know, when we actually train a model to be aligned, we don't know what happens inside the model.

Speaker 1 right there are different ways of training it to be aligned but i think we don't really know what happens i mean i think for some of the current methods, I think all the current methods that involve some kind of fine-tuning, of course, have the property that the underlying knowledge and abilities that we might be worried about don't disappear.

Speaker 1 It's just, you know, the model is just taught not to output them. I don't know if that's a fatal flaw or

Speaker 1 if that's just the way things have to be.

Speaker 1 I don't know what's going on inside mechanistically, and I think that's the whole point of mechanistic interpretability, to really understand what's going on inside the models at the level of individual circuits.

Speaker 2 Eventually, when it's solved, what does a solution look like? What is it the case where if you're Cloud4, you do the mechanistic attributally thing and you're like, I'm satisfied, it's aligned.

Speaker 2 What is it that you've seen?

Speaker 1 Yeah, so

Speaker 1 I think we don't know that yet. I think we don't know enough to know that yet.
I mean, I can give you a sketch for like what the process looks like as opposed to what the final result looks like.

Speaker 1 So, I think verifiability is a lot of the challenge here, right? We have all these methods that purport to align AI systems and do succeed at doing so for today's tasks.

Speaker 1 But then the question is always, if you had a more powerful model or if you had a model in a different situation,

Speaker 1 would it be aligned? And so I think this problem would be much easier. If you had an Oracle that could just scan a model and say, like, okay, I know this model is aligned.

Speaker 1 I know what it'll do in every situation.

Speaker 1 Then the problem would be much easier. And I think the closest thing we have to that is something like mechanistic interpretability.
It's not anywhere near up to the task yet.

Speaker 1 But I guess I would say I think of it as almost like an extended training set and an extended test set, right?

Speaker 1 Everything we're doing, all the alignment methods we're doing are the training set, right?

Speaker 1 You can run tests in them, but will it really work out a distribution? Will it really work in another situation?

Speaker 1 Mechanistic interpretability is the only thing that even in principle, and we're nowhere near there yet, but even in principle is the thing where it's like, it's more like an x-ray of the model than a modification of of the model, right?

Speaker 1 It's more like an assessment than an intervention.

Speaker 1 And so somehow we need to get into a dynamic where we have an extended test set, an extended training set, which is all these alignment methods, and an extended test set, which is kind of like

Speaker 1 you x-ray the model and say,

Speaker 1 okay, what works and what didn't, in a way that goes beyond just the empirical tests that

Speaker 1 you've run, right?

Speaker 1 Where you're saying,

Speaker 1 what is the model going to do in these situations? What is it within its capabilities to do instead of what did it do phenomenologically? And of course, we have to be careful about that, right?

Speaker 1 One of the things I think is very important is we should never train for interpretability because I think

Speaker 1 that's taking away that advantage, right? You even have the problem, you know, similar to like validation versus test set, where like, if you look at the x-ray too many times, you can interfere.

Speaker 1 But I think that's a much weaker optimum. We should worry about that, but that's a much weaker process.

Speaker 1 It's not automated optimization we should just make sure as with validation and test sets that we don't look at the validation set too many times before running the test set but you know that's again that's that's more of a that's that's manual pressure rather than automated pressure and so some solution where it's like we have some dynamic between the training and test set where it's like we're we're trying things out and we we we really figure out if they work via way of testing them that the model isn't optimizing against some some orthogonal way.

Speaker 1 Like,

Speaker 1 if I think of, and I think we're never going to have a guarantee, but some process where we do those things together, again, not in a stupid way.

Speaker 1 There's lots of stupid ways to do this where you fool yourself, but like

Speaker 1 some way to put extended training for alignment ability with extended testing for alignment ability together in a way that actually works.

Speaker 2 I still don't feel like I understand the intuition that

Speaker 1 why you think this is likely to work or this is a promising thing to pursue.

Speaker 2 And let me ask the question in a sort of more specific way and excuse the tortured analogy. But listen, if you're an economist and you want to understand the economy,

Speaker 2 so you send a whole bunch of microeconomists out there and one of them studies how the restaurant business works, one of them studies how the tourism business works,

Speaker 2 one of them studies how the baking works. And at the end, they all come together and you still don't know whether there's going to be a recession in five years or not.

Speaker 2 Why is this not like that, where you have an understanding of we understand how induction heads work in a two-layer transformer, we understand, you know, modular arithmetic.

Speaker 2 How does this add up to does this model want to kill us? Like, what does this model fundamentally want?

Speaker 1 A few things on that. I mean, I think that's like the right set of questions to ask.

Speaker 1 I think what we're hoping for in the end is not that we'll understand every detail, but again, I would give like the x-ray or the MRI analogy that like we can be in a position where we can look at the broad features of the model and say,

Speaker 1 is this a model whose internal state and plans are very different from what it externally represents itself to do, right?

Speaker 1 Is this a model where we're uncomfortable that, you know, far too much of its computational power is, you know, is devoted to doing what look like fairly destructive and manipulative things?

Speaker 1 Again, we don't know for sure whether that's possible, but I think some at least positive signs that it might be possible. Again, the model is not intentionally hiding from you, right?

Speaker 1 It might turn out that the training process hides it from you.

Speaker 1 And, you know, I can think of cases where if the model is really super intelligent, it like thinks in a way so that it like affects its own cognition. I suspect we should think about that.

Speaker 1 We should consider everything.

Speaker 1 I suspect that it may roughly work to think of the model as, you know, if it's trained

Speaker 1 in the normal way, just at, you know, at the at the just getting to just above human level, it may be a reasonable, you should check, it may be a reasonable assumption that the internal structure of of the model is not intentionally optimizing against us.

Speaker 1 And I give an analogy like to humans. So it's actually possible

Speaker 1 to, you know, to look at an MRI of someone

Speaker 1 and predict above random chance whether they're a psychopath.

Speaker 1 There was actually a story a few years back about a neuroscientist who was studying this, and then he looked at his own scan and discovered that he was a psychopath. And then

Speaker 1 everyone in his life was like, no, no, no, that's just obvious. Like, you're a complete asshole.
Like, you must be a a psychopath.

Speaker 1 And he was totally unaware of this. The basic idea that

Speaker 1 there can be these macro features that, like, psychopath is probably a good analogy for it, right?

Speaker 1 They're like, you know, this is what we'd be afraid of, a model that's kind of like charming on the surface, very goal-oriented, and, you know, very dark on the inside.

Speaker 1 You know, and, you know, on the surface, their behavior might look like the behavior of someone else, but their goals are very different.

Speaker 2 A question somebody might have is, listen, you know, you mentioned earlier the importance of being empirical. Yeah.

Speaker 2 And in this case, you're trying to estimate, you know, listen, are these activations sus?

Speaker 2 But is this something we can afford to be empirical about in on,

Speaker 2 or do we need a very good first principles theoretical reason to think, no, it's not just that these MRIs of the model correlate with, you know, being bad. We need just like some

Speaker 2 deep root math proof that this is a line.

Speaker 1 So it depends what you mean by empirical. I mean, a better term would be phenomenological.

Speaker 1 Like, I I don't think we should be purely phenomenological in, like, you know, here are some brain scans of like really dangerous models and here are some brain scans.

Speaker 1 I think the whole idea of mechanistic interpretability is to look at the underlying principles in circuits.

Speaker 1 But I guess the way I'd think about it is like, on one hand, I've actually always been a fan of studying these circuits at the lowest level of detail that we possibly can.

Speaker 1 And the reason for that is kind of that's how you build up knowledge. Even if you're ultimately aiming for

Speaker 1 there's too many of these features, it's too complicated. The end of the day, we're trying to build something broad and we're trying to build some broad understanding.

Speaker 1 I think the way you build that up is by trying to make a lot of these very specific discoveries.

Speaker 1 Like you have to, you have to understand the building blocks and then you have to figure out how to kind of use that to draw these broad conclusions, even if you're not going to figure out everything.

Speaker 1 You know, I think you should probably talk to Chris Ola, who would have much more detail, right?

Speaker 1 This is my kind of high-level thinking on it like chris ola controls the interpretability agenda like you know he's he's the one who decides what to what to do on interpretability this is my high-level thinking about it which is not going to be as good as his does the bookcase on anthropic rely on the fact that mechanistic interpretability is helpful for capabilities i i don't think so at all um uh now i do think I think in principle it's possible that mechanistic interpretability could be helpful with capabilities.

Speaker 1 We might, for various reasons, not choose to talk about it if that were the case.

Speaker 1 That, you know, that wasn't something that I thought of, thought of, or that any of us thought of at the time of Anthropic's founding.

Speaker 1 I mean, we thought of ourselves as like, you know, we're people who are like good at scaling models and good at doing safety on top of those models.

Speaker 1 And like, you know, we think that we have a very high talent density of folks who are good at that. And, you know, my view has always been talent density beats talent mass.

Speaker 1 And so, you know,

Speaker 1 that's more of our bull case. Talent density beats talent mass.
I don't think it depends on some particular thing. Like, others are starting to do mechanistic interpretability now.

Speaker 1 And I'm very glad that they are.

Speaker 1 You know, that is

Speaker 1 a part of our theory of change is paradoxically to make other organizations more like us.

Speaker 2 Talent density, I'm sure, is important. But another thing Anthropic has emphasized is that you need to have frontier models in order to do safety research.

Speaker 2 And of course, actually be a company as well. The current frontier models, something somebody might guess, like GPT 4 o'clock to like $100 million, something like that.

Speaker 1 That general order of magnitude in very broad terms is not wrong.

Speaker 2 But, you know, we're two to three years from now, the kinds of things you're talking about, we're talking more and more orders of magnitude to keep up with that.

Speaker 2 And if it's the case that safety requires to be on the frontier, I mean, what is a case in which Anthropic is like competing with these Leviathans to stay on that same scale?

Speaker 1 I mean,

Speaker 1 I think it's a very, it's a situation with a lot of trade-offs, right?

Speaker 1 I think it's not easy.

Speaker 1 I guess to go back, maybe I'll just like answer the questions one by one, right? So to go back to like, you know, why, why is safety so tied to scale, right?

Speaker 1 Some people don't think it is, but like if I, if I just look at like, you know, where, where, where have been, where have been the areas that, you know, you know, I don't know, like safety methods have like been put into practice or like worked for something, for anything, even if we don't think they'll, they'll work in general, you know, I go back to thinking of all the ideas, you know, something like, you know, debate and amplification, right?

Speaker 1 You know, back in 2018, when we wrote papers about those at Open AI, it was like, well, human feedback isn't quite going to work, but debate and amplification will take us beyond that. But then

Speaker 1 if you actually look at, and we've done attempts to do debates, we're really limited by the quality of the model, where it's like,

Speaker 1 for two models to have a debate that is coherent enough that a human can judge it so that the training process can actually work,

Speaker 1 you need models that are at or maybe even beyond on some topics, the current frontier. Now, you can come up with a method.
You can come up with the idea without being on the frontier.

Speaker 1 But, you know, for me, that's a very small fraction of what needs to be done, right? It's very easy to come up with these methods.

Speaker 1 It's very easy to come up with like, oh, the problem is X, maybe a solution is Y. But, you know, I really want to know.
you know, whether things work in practice, even for the systems we have today.

Speaker 1 And I want to know what kinds of things go wrong with them. I just feel like you discover 10 new ideas and 10 new ways that things are going to go wrong by trying these in practice.
And

Speaker 1 that empirical learning, I think it's just not as widely understood as it should be.

Speaker 1 I would say the same thing about methods like constitutional AI. And some people say, oh, it doesn't matter.
Like we know this method doesn't work. It won't work for pure alignment.

Speaker 1 I neither agree nor disagree with that. I think that's just kind of overconfident.

Speaker 1 The way we discover new things and understand the structure of what's going to work and what's not is by playing around with things.

Speaker 1 Not that we should just kind of blindly say, say oh this worked here and so so it'll work there but you you you really you really start to understand the patterns like with like with the scaling laws even mechanistic interpretability which might be the one area i see where a lot of progress has been made without the frontier models we're you know we're seeing in you know the work that say open ai put out a couple a couple months ago that you know using very powerful models to help you auto interpret the weak models again that's not everything you can do in interpretability but

Speaker 1 that's a big component of it. And

Speaker 1 we found it useful too. And so you see

Speaker 1 this phenomenon over and over again, where it's like

Speaker 1 the scaling and the safety are these two snakes that are like coiled with each other, always even more than you think, right?

Speaker 1 With interpretability, like I think three years ago, I didn't think that this would be as true of interpretability, but somehow it manages to be true. Why? Because intelligence is useful.

Speaker 1 It's useful for a number of tasks. One of the tasks it's useful for is like figuring out how to judge and evaluate other intelligence and maybe someday even for doing the alignment research itself.

Speaker 2 Given all that's true, what does that imply for Anthropic when in two to three years, these Leviathans are doing like $10 billion training runs?

Speaker 1 Choice one is if we can't or if it costs too much to stay on the frontier, then

Speaker 1 we shouldn't do it. And we won't work with the most advanced models.
We'll see what we can get with models that are not quite as advanced.

Speaker 1 I think you can get some value there, like non-zero value, but I'm kind of skeptical that the value is all that high or the learning can be fast enough to really, to really be in favor of the task.

Speaker 1 The second option is you just find a way.

Speaker 1 You just accept the trade-offs. And I think the trade-offs are more positive than they appear because of a phenomenon that I've called race to the top.

Speaker 1 I could go into that later, but I'll just let me put that aside for now.

Speaker 1 And then I think the third phenomenon is, you know, as things get, as things get to that scale, I think this may coincide with, you know, starting to get into some non-trivial probability of very serious danger.

Speaker 1 Again, I think it's going to come first from misuse, the kind of bio stuff that I talked about, but I don't think we have the level of autonomy yet to worry about some of the, you know, alignment stuff happening in like two years, but it might not be very far behind that at all.

Speaker 1 You know, that that may that may lead to unilateral or multilateral or government enforced, which we support, decisions not to scale as fast as we could. That may end up being the right thing to do.

Speaker 1 So

Speaker 1 actually, that's kind of like, I kind of hope things go in that direction.

Speaker 1 And then we don't have this hard trade-off between we're not on the frontier and we can't quite do the research as well as as well as we want or influence other orgs as well as we want, or versus we're kind of on the frontier and like have to accept the trade-offs, which are which are net positive, but like have a have a lot in both directions.

Speaker 2 Okay, on the misuse versus misalignment, those are both problems, as you mentioned. But in the long scheme of things,

Speaker 1 what are you more concerned about?

Speaker 2 Like 30 years down the line, which do you think will be considered a bigger problem?

Speaker 1 I think it's much less than 30 years, but I'm worried about both. I don't know.

Speaker 1 If you have a model that could, in theory, you know, like take over the world on its own, if you were able to control that model, then you know, it follows pretty simply that you know, if a model was following the wishes of some small subset of people and not others, then those people could use it to take over the world on their behalf.

Speaker 1 The very premise of misalignment means that we should be worried about misuse as well, with similar levels of consequences.

Speaker 2 But, but some people who might be more doomery than you would say, misuse is

Speaker 2 you're already working towards the optimistic scenario there because you've you've at least figured out how to align the model with the bad guys.

Speaker 2 Now you just need to make sure it's aligned with the good guys instead. Why do you think that you could get to the point where it's aligned with the bad guys?

Speaker 2 You know, you haven't already solved this.

Speaker 1 I guess if you had the view that like alignment is completely unsolvable, then

Speaker 1 you'd be like, well, I don't, you know, we're dead anyway, so I don't want to worry about misuse. That's not my position at all.

Speaker 1 But also, like, you should think in terms of like, what's a plan that would actually succeed, that would make things good.

Speaker 1 Any plan that actually succeeds, regardless of how hard misalignment is to solve, any problem, any plan that actually succeeds is going to need to solve misuse as well as misalignment.

Speaker 1 It's going to need to solve the fact that like as the AI models get better, you know, faster and faster, they're going to create a big problem around the balance of power between countries.

Speaker 1 They're going to create a big problem around is it possible for a single individual to do something bad that it's hard for everyone else to stop.

Speaker 1 Any actual solution that needs to leads to a good future needs to solve those problems as well.

Speaker 1 If your perspective is we're screwed because we can't solve the first problem, so don't worry about problems two and three, like that, that's not really a statement.

Speaker 1 You shouldn't worry about problems two and three, right? Like they're in our path no matter what.

Speaker 2 Yeah, in this scenario, we succeed. We have to solve all of it.
So yeah, you might as well operate.

Speaker 1 We should be planning for success, not for failure.

Speaker 2 If Misus doesn't happen and the right people have the superhuman models, what does that look like? Like, who are the right people? Who is actually controlling the model from five years from now?

Speaker 1 Yeah, I mean,

Speaker 1 my view is that these things are powerful enough that I think, you know, it's going to involve, you know, substantial role or at least involvement of, you know, some kind of government or assembly of government bodies.

Speaker 1 Again, like, you know, there are kind of very naive versions of this. Like, you know, I don't think we should just...

Speaker 1 you know, I don't know, like hand the model over to the UN or whoever happens to be in office at a given time. Like, I could see that go poorly, but

Speaker 1 it's too powerful. There needs to be some kind of legitimate process for managing this technology, which

Speaker 1 includes the role of the people building it, includes the role of democratically elected authorities, includes the role of

Speaker 1 all the individuals who will be affected by it. So

Speaker 1 at the end of the day, there needs to be some politically legitimate process.

Speaker 2 But what does that look like? If it's not the case that you just hand it to whoever the president is at the time,

Speaker 2 what does the body look like?

Speaker 1 I mean, is this something these are things it's really hard to know ahead of time.

Speaker 1 Like, I think, you know, people love to kind of propose these broad plans and say, like, oh, this is the way we should do it. This is the way we should do it.

Speaker 1 I think the honest fact is that we're figuring this out as we go along. And that, you know,

Speaker 1 anyone who says, you know, this is, this is the body that, you know, we should create this kind of body modeled after this thing. Like, I think.

Speaker 1 I think we should try things and experiment with them with less powerful versions of the technology.

Speaker 1 We need to figure this out in time, but also it's not really the kind of thing you can know in advance.

Speaker 2 The long-term benefit trust that you have,

Speaker 2 how would that interface with this body? Is that the body itself? If not, is it like, is it just for the context? You might want to explain what it is for the audience, but I don't know.

Speaker 1 I think of the long-term benefit trust as like a much, a much narrower thing. Like, this is something that like makes decisions for Anthropic.

Speaker 1 So, this is basically a body, it was described in a recent Vox article. We'll be saying more about it in, you know, later, later this year.

Speaker 1 But it's basically a body that over time uh gains the ability to appoint the majority of the board seats of anthropic uh and this is so you know it's a mixture of experts and i'd say like ai alignment national security and philanthropy in general but if control is handed to them of anthropic that doesn't imply that control of if anthropic has a gi that control of agi itself is handed to them that doesn't imply that Anthropic or any other entity should be the entity that like makes decisions about AGI on behalf of humanity.

Speaker 1 I would think of those as different.

Speaker 1 I mean, there's lots of, you know, like if Anthropic does play a broad role, then you'd want to like widen that body to be, you know, like a whole bunch of different people from around the world.

Speaker 1 Or, or maybe you can strew this as very narrow, and then, you know, there's some like broad committee somewhere that like manages all the AGIs of all the companies on behalf of behalf of anyone.

Speaker 1 I don't know. Like, I think my view is you shouldn't be sort of overly constructive and utopian.
Like, we're dealing with a new problem here.

Speaker 1 We need to start thinking now about, you know,

Speaker 1 what are the governmental bodies and structures that could deal with it?

Speaker 2 Okay, so let's forget about governance. Let's just talk about what this going well looks like.

Speaker 2 Obviously, there's the things we can all agree on, you know, cure all the diseases, you know, solve all the problems, things all humans would say, I'm down for that.

Speaker 2 But now it's 2030. You've solved all the real problems that everybody can agree on.

Speaker 2 What happens next? What are we doing with a superhuman god?

Speaker 1 I think I actually want to like, I don't know, like disagree with the framing or something like this.

Speaker 1 I actually get nervous when someone says, like, what are you going to do with a superhuman AI?

Speaker 1 Like, we've learned a lot of things over the last 150 years about like markets and democracy and each person can kind of define for themselves like what the best way for them to have the human experience is and that, you know, societies work out norms and what they value in this, just in this very like complex and decentralized way.

Speaker 1 Now, again, if you have these safety problems, that can be a reason why, why, you know, and especially from the government, there needs to be, maybe until we've solved these problems, a certain amount of like centralized control.

Speaker 1 But as a matter of like, we've solved all the problems. Now, how do we make things good? I think that most

Speaker 1 people, most groups, most ideologies that started with like, let's sit down and

Speaker 1 think over what the definition of a good life is.

Speaker 1 I think most of those have led to disaster.

Speaker 2 But so this vision you have of a sort of tolerant liberal democracy market-oriented system with AGI. Like, what is each person has their own HEI? Like, what is that? What does that mean?

Speaker 1 I don't know. I don't know what it looks like, right? Like, I guess what I'm saying is, like, we need to solve the kind of important safety problems and the important externalities.

Speaker 1 And then, and then, subject to that, you know, which again, you know, those could be just narrowly about alignment.

Speaker 1 There could be a bunch of economic issues that are super complicated and that we can't solve. You know, subject to that, like, we should think about what's worked in the past.

Speaker 1 And I think in general, like

Speaker 1 unitary visions for what it means to live a good life have not worked out well at all.

Speaker 2 On the opposite end of things going well or good actors having control of AI,

Speaker 2 we might want to touch on China as a potential actor in the space.

Speaker 2 So first of all, I mean, being at Baidu and like seeing progress in AI happening generally,

Speaker 2 why do you think the Chinese have underperformed? You know, Baidu had a scaling laws group many years back, or is the premise wrong? And I'm just not aware of the progress that's happening there.

Speaker 1 Well, for the Scaling Laws group, I mean, that was an offshoot of the stuff we did with speech.

Speaker 1 So, you know, there were still some people there, but that was a mostly Americanized lab. I mean, I was there for a year.
That was, you know, my first foray into deep learning.

Speaker 1 It was led by Andrew Ng. I never went to China.
Most, you know, it was like a U.S. lab.

Speaker 1 So I think that was somewhat disconnected, although it was an attempt by, you know, a Chinese entity to kind of get it, get into the game.

Speaker 1 But I don't know.

Speaker 1 I think since then, you know, I couldn't speculate, but I think they've been been maybe very commercially focused and not as focused on these kind of fundamental research side of things around scaling laws.

Speaker 1 Now, I do think because of all the, you know, excitement with the release of ChatGPT in, you know, November or so, you know, that's been a starting gun for them as well.

Speaker 1 And they're trying very aggressively to catch up now.

Speaker 1 I think we're, the U.S. is quite substantially ahead, but I think they're trying very hard to catch up now.

Speaker 2 How do you think China thinks about AGI? Are they thinking about safety and misuse or not?

Speaker 1 I don't really have a sense.

Speaker 1 You know, one concern I would have or if people say things like, well, China isn't going to develop an AI because they like stability or

Speaker 1 they're going to have all these restrictions to make sure things are in line with what the CCP wants.

Speaker 1 That might be true in the short term. And for consumer products, my worry is that if the basic incentives are about national security and power, that's going to become clear sooner or later.

Speaker 1 And And so, you know,

Speaker 1 I think they're going to, if they see this as, you know, a source of national power, they're going to at least try to do what's most effective.

Speaker 1 And that, you know, that could lead them in the direction of AGI.

Speaker 2 At what point, like, is it possible for them to just get your blueprints or your code base or something that they can just spin up their own lab that is competitive at the frontier with the leading American companies?

Speaker 1 Well, I don't know about FAST, but I'm like, I'm concerned about this.

Speaker 1 So this is one reason why we're focusing so hard on cybersecurity. You You know, we've worked with our cloud providers.

Speaker 1 We really, you know, like, you know, we had this blog post out about security where we said, you know, we have a two-key system for access to the model weights.

Speaker 1 We have other measures that we put in place or thinking of putting in place that, you know, we haven't announced.

Speaker 1 We don't want an adversary to know about them, but we're happy to talk about them broadly. All this stuff we're doing is, by the way, not sufficient yet for a super determined

Speaker 1 state-level actor at all.

Speaker 1 I think it will defend against most attacks and against a state-level actor who's not, you know, who's less determined.

Speaker 1 But there's a lot more we need to do, and some of it may require new research on how to do security.

Speaker 2 Okay, so let's talk about what it would take at that point.

Speaker 2 You know, we're at anthropic offices, and it's like got good security. We had to get badges and everything to come in here.

Speaker 2 But the eventual version of this building or bunker or whatever where the AGI is built, I mean, what does that look like?

Speaker 2 Is it a building in the middle of San Francisco or is it you're out in the middle of Nevada or Arizona? Like, what is the point in which you're like Los Island most of it?

Speaker 1 At one point, there was a running joke somewhere that

Speaker 1 the way building AGI would look like is there would be a data center next to a nuclear power plant next to a bunker.

Speaker 1 And that we'd all kind of live in the bunker and everything would be local, so it wouldn't get on the internet.

Speaker 1 Again,

Speaker 1 if we take seriously the rate at which the

Speaker 1 rate at which all this is going to happen, which I don't don't know. I can't be sure of it.
But if we take that seriously, then

Speaker 1 it does make me think that maybe not something quite as cartoonish as that, but that something like that might happen.

Speaker 2 What is the time scale on which you think alignment is solvable? If these models are getting to human level in some things in two to three years, what is the point at which they're aligned?

Speaker 1 I think this is a really difficult question because I actually think often people are thinking about kind of alignment in the wrong way.

Speaker 1 I think there's a general feeling that it's like models are misaligned or like there's like an alignment problem to solve, kind of like the Riemann hypothesis or something.

Speaker 1 Like someday we'll crack the Riemann hypothesis. I don't quite think it's like that.
Not in a way that's either that's worse or better. It might be just as bad or just as unpredictable.

Speaker 1 When I think of like, you know,

Speaker 1 why am I scared?

Speaker 1 A few things I think of. One is, look, like, I think the thing that's really hard to argue with is like, there will be powerful models.
They will be agentic.

Speaker 1 We're getting towards them if such a model wanted to wreak havoc and destroy humanity or whatever i i think we have basically no ability to stop it like that's that's i think just just if that's not true at some point it'll continue to be true as we you know it will reach the point where it's true as we scale the models um so that definitely seems the case and i think a second thing that seems the case is that we seem to be bad at controlling the models, not in any particular way, but just they're statistical systems and you can ask them a million things, and they can say a million things in reply.

Speaker 1 And, you know, you might not have thought of a millionth of one thing that does something crazy.

Speaker 1 Or when you train them, you train them in this very abstract way, and you might not understand all the consequences of what they do in response to that.

Speaker 1 I mean, I think the best example we've seen of that is like

Speaker 1 being in Sydney, right? Where it's like, I don't know how they trained that model.

Speaker 1 I don't know what they did to make it do all this weird stuff, like, you know, threaten, threaten people and, you know, have this kind of weird, obsessive personality.

Speaker 1 But, but what it shows is that we can get something very different from and maybe opposite to what we intended.

Speaker 1 And so, I actually think facts number one and fact number two are like enough to be really worried.

Speaker 1 Um, like, you don't need all this detailed stuff about, you know, convergent instrumental goals or, you know, analogies to evolution. Like, actually, one and two for me are pretty motivated.

Speaker 1 I'm like, okay, this thing's going to be powerful. It could destroy us.

Speaker 1 And, like, all the ones we've built so far, like, you know, are at pretty decent risk of doing some random shit we don't understand.

Speaker 2 Yeah, if I agree with that, and I'm like, okay, I'm concerned about this. The research agenda you have of mechanistic interoperability plus constitution AI and the other ILHF stuff.

Speaker 2 If you say that we're going to get something with like bioweapons or something that could be dangerous in two to three years,

Speaker 2 do these things culminate within two to three years of actually meaningfully contributing to preventing

Speaker 1 so I think where I was going to go with this is like, you you know, people talk about like Doom by default or alignment by default. Like, I think it might be kind of statistical.

Speaker 1 Like, you know, like you might get, you know, with the current models, you might get Bing or Sydney or you might get Claude. And it doesn't really matter because Bing or Sydney, like...

Speaker 1 If we take our current understanding and, you know, move that to very powerful models, you might just be in this world where it's like, okay, you make something and depending on the details, maybe it's totally fine.

Speaker 1 You know, not really alignment by default, but just kind of like, it depends on a lot of the details.

Speaker 1 And like, if you, if you're very careful about all those details and you know what you're doing, you're getting it right.

Speaker 1 But we have a high susceptibility to you mess something up in a way that you didn't really understand was connected to actually instead of making all the humans happy, it wants to, you know, turn them into pumpkins.

Speaker 1 Yeah, I, you know, I just some weird shit, right?

Speaker 1 Because the models are so powerful, you know, they're like these kind of giants that are, you know, they're, they're like, you know, they're standing in a landscape.

Speaker 1 And if they start to move their arms around randomly, they could just break everything.

Speaker 1 I guess I'm I'm starting it with that kind of framing because it's not like, I don't think we're aligned by default. I don't think we're doomed by default and have some problem we need to solve.

Speaker 1 It has some kind of different character. Now, what I do think is that hopefully within a time scale of two to three years, we get better at diagnosing when the models are good and when they're bad.

Speaker 1 We get better at training, you know, increasing our repertoire of methods to train the model that they're less likely to do bad things and more likely to do good things in a way that isn't just relevant to the current models, but scales.

Speaker 1 And we can help develop that with interpretability as the test set. I don't think of it as, oh man, we tried RLHF, it didn't work.
We tried constitutional, it didn't work.

Speaker 1 Like, we tried this other thing, it didn't work. We tried mechanistic interpretability.
Now we're going to try mechanistic.

Speaker 1 I think this frame of like, man, we haven't cracked the problem yet. We haven't solved the Riemann hypothesis isn't quite right.

Speaker 1 I think of it more as

Speaker 1 already with today's systems, we are not very good at controlling them. And the consequences of that

Speaker 1 could be very bad. We just need to get more ways of like increasing the likelihood

Speaker 1 that we can control our models and understand what's going on in them. And we have some of them so far.

Speaker 2 They aren't that good yet.

Speaker 1 But

Speaker 1 I don't think of it as this binary of works and not works. We're going to develop more.

Speaker 1 And I do think that over the next two to three years, we're going to start eating that probability mass of ways things can go wrong.

Speaker 1 You know, it's kind of like in the core safety views paper, right? There's this probability mass of how hard the problem is. I feel like that way of stating it isn't really even quite right, right?

Speaker 1 Because I don't feel like it's the Riemann hypothesis to solve.

Speaker 1 I just feel like

Speaker 1 it's almost like right now, if I try and juggle five balls or something, I can juggle three balls, right? I actually can, but I can't juggle five balls at all, right?

Speaker 1 You have to practice a lot to do that. If I were to do that, I would mostly drop, I would almost certainly certainly drop them.

Speaker 1 And then just over time, you just get better at the task of controlling the balls.

Speaker 2 On that post in particular, what is your personal probability distribution over, so for the audience, the three possibilities are it is like trivial to align these models with RLHF ⁇

Speaker 2 to it is a difficult problem, but one that a big company could solve to something that is like basically impossible for human civilization currently to solve.

Speaker 2 If I'm capturing those three, what is your probability distribution over those three personally?

Speaker 1 Yeah, I mean, I'm not super into like, what's your probability distribution of X? I think all of those have enough likelihood that, you know, they should be considered seriously.

Speaker 1 I'm more interested, the question I'm much more interested in is, what could we learn that shifts probability masks between them? What is the answer to that?

Speaker 1 I think that one of the things mechanistic interpretability is going to do

Speaker 1 more than necessarily solve problems is it's going to tell us what's going on when we try to align models.

Speaker 1 I think it's basically going to teach us about this.

Speaker 1 Like one way I could imagine concluding that things are very difficult is if mechanistic interpretability sort of shows us that, I don't know, problems tend to get moved around instead of being stamped out, or that you get rid of one problem, you create another one, or it might inspire us or give us insight into why problems are kind of persistent or hard to eradicate or crop up.

Speaker 1 Like for me to really believe some of these stories about like, you know, oh, something will always, you know, there's always this convergent goal in this particular direction.

Speaker 1 I think the abstract story is, it's not uncompelling, but I don't find it really compelling either. Nor do I find it necessary to motivate all the safety work.

Speaker 1 But like the kind of thing that would really be like, oh man, we can't solve this is like.

Speaker 1 we see it happening inside inside the x-ray because yeah because i i think right now there's just there's there's way there's way too many assumptions there's way too much overconfidence about how all this is going going to go.

Speaker 1 I have a substantial probability mass on this all goes wrong. It's a complete disaster, but in a completely different way than anyone had anticipated.

Speaker 2 It would be beside the point to ask: how could it go different than anyone anticipated? So,

Speaker 2 on this in particular, what information would be relevant? How much would the difficulty of aligning Cloud3

Speaker 2 and the next generation of models basically be? Like, is that a big piece of information? Is that not?

Speaker 1 So, I think the people who are most worried are predicting that all the subhuman like AI models are going to be alignable, right? They're going to seem aligned.

Speaker 1 They're going to deceive us in some way. I think it certainly gives us some information, but

Speaker 1 I am more interested in what mechanistic interpretability can tell us

Speaker 1 because,

Speaker 1 again,

Speaker 1 you see this x-ray. It would be too strong to say it doesn't lie, but...
at least in the current systems, it doesn't feel like it's optimizing against us. There are exotic ways that it could.

Speaker 1 You know, I don't think anything is a safe bet here, but I think it's the closest we're going to get to something that isn't actively optimizing against us.

Speaker 2 Let's talk about the specific methods other than mechanistic interpretability that you guys are researching. When we talk about RLHF or Constitution AI,

Speaker 2 whatever, RLHF.

Speaker 2 If you had to put it in terms of human psychology, what is the change that is happening?

Speaker 1 Are we creating new drives, new goals, new thoughts?

Speaker 2 How is the model changing in terms of psychology?

Speaker 1 I think all those terms are kind of like inadequate for describing what's it's not clear how useful they are as abstractions for humans either.

Speaker 1 I think we don't have the language to describe what's going on. And again, I'd love to have the x-ray.
I'd love to look inside and say, and kind of actually know what we're talking about instead of

Speaker 1 you know, basically making up words, which is what, which is what I do, what you're doing in asking this question.

Speaker 1 Where, you know,

Speaker 1 we should just be honest.

Speaker 1 We really have very little idea what we're talking about. So it would be great to say, well, what we actually mean by that is

Speaker 1 this circuit within here

Speaker 1 turns on and

Speaker 1 after we've trained the model, then this circuit is no longer operative or weaker. And that way, I would love to be able to say, again,

Speaker 1 it's going to take a lot of work to be able to do that.

Speaker 2 Model organisms, which you hinted at before when you said we're doing these evaluations to see if... they're capable of doing dangerous things now and currently not.

Speaker 2 How worried are you about a lab leak scenario where in fine-tuning it or in trying to get these models to elicit dangerous behaviors, you know, make bioweapons or something, it like leaks somehow and actually makes the bioweapon instead of telling you it can make the bioweapon?

Speaker 1 With today's passive models, I think it's not that much, you know, chatbots, it's not so much of a concern, right?

Speaker 1 Because it's like, you know, if we were to fine-tune a model, do that, we'd do it privately and work with the experts.

Speaker 1 And so, you know, the leak would be like, you know, suppose the model got open sourced or something. And, you know, and then someone, so I think for now it's mostly a security issue.

Speaker 1 In terms of models truly being dangerous, I mean, you know, I think, I think we do have to worry that it's like, you know, if we make a truly powerful model and we're trying to like see what makes it dangerous or safe, then there could be more of a one-shot thing where it's like, you know, some risk that the model takes over.

Speaker 1 I think the main way to control that is to make sure that the capabilities of the model that we test are not such that they're capable of doing this.

Speaker 2 At what point would the capabilities be so high where you say, I don't even want to test this?

Speaker 1 Well, there's different things. I mean, there's capability testing, and you know,

Speaker 2 but that itself could lead to if you're testing replicate, that like, what if it actually does?

Speaker 1 Sure, but I think, I mean, I think what you want to do is you want to like extrapolate. So, we've talked with Arc about this, right?

Speaker 1 You know, you have like factors of two of compute or something where you're like, okay, you know, you know, can the model do something like, you know, open up an account on AWS and like make some money for itself?

Speaker 1 Like, some of the things that are like obvious prerequisites to like complete survival in the wild.

Speaker 1 And so just set those thresholds very well, you know, kind of very well below. And then as you proceed upward from there, do kind of more and more rigorous tests and be more and more careful about

Speaker 1 what it is you're doing.

Speaker 2 On Constitution AI, and feel free to explain what this is for the audience, but who decides what the constitution for the next generation of models or a potentially superhuman model is?

Speaker 2 How is that actually written?

Speaker 1 I think initially, you know, to make the Constitution, we just took some stuff that was like broadly agreed on, like the UN Charter of the UN Declaration on Human Rights and

Speaker 1 some of the stuff from Apple's terms of service, right? Stuff that's like consensus on what's acceptable to say or

Speaker 1 what basic things are able to be included. So, one, I think for future constitutions, we're looking into like more participatory processes for making these.

Speaker 1 But I think beyond that, I don't think there should be like one constitution for like a model that everyone uses. Like probably

Speaker 1 models constitution should be very, very simple, right? It should only have very basic facts that everyone would agree on.

Speaker 1 Then there should be a lot of ways that you can customize, including appending, you know, constitutions. And, you know, I think beyond that, we're developing new methods, right?

Speaker 1 This is, you know, I'm not imagining that this or this alone is the method that we'll use to train superhuman AI, right? Many of the parts of capability training may be different.

Speaker 1 And so, you know, it could look very different. And again, I'd go there, like, there are levels above this.
Like, I'm pretty uncomfortable with, like, here's the AI's constitution.

Speaker 1 It's going to run the world. Like, that, you know, again, like, just normal lessons from like how societies work and how politics works like that.
That just kind of, yeah, that strikes me as fanciful.

Speaker 1 Like, I, you know, I think, I think we should try to hook these things into,

Speaker 1 you know, even when they're very powerful, again, after we've mitigated the safety issues, like any good future, even if it has all these security issues that we need to solve, it somehow needs to end with something

Speaker 1 that's more decentralized and less like a godlike super.

Speaker 1 I just don't think that ends well.

Speaker 2 What scientists from the Manhattan Project do you respect most in terms of they acted most ethically under the constraints they were given?

Speaker 2 Is there one that comes to mind?

Speaker 1 I don't know. I mean,

Speaker 1 I think there's a lot of answers you could give. I mean, I'm definitely a fan of Zillard for having kind of figured it out.
He was then

Speaker 1 against the actual dropping of the bomb. I don't actually know the history well enough to have an opinion on whether demonstration of the bomb could have ended the war.

Speaker 1 I mean, that involves a bunch of facts about Imperial Japan that are, you know, that are, that are complicated and that I'm not an expert on.

Speaker 1 But, you know, Zillard seemed to, you know, he discovered this stuff early. He kept it secret, you know, you know, you know, patented some of it and put it in the hands of the British Admiralty.

Speaker 1 So, you know, he seemed to display the right kind of awareness as well as

Speaker 1 as well as discovering stuff.

Speaker 1 I mean, it was when I read that book that I kind of, you know, when I wrote this big blob of compute doc and many other, you know, I only showed it to a few people and there were other docs that I showed to almost no one.

Speaker 1 So, you know, yeah, I was a bit, a bit inspired by this. Again, I mean, I, you know, we could all get self-aggrandizing here.

Speaker 1 Like, we don't know how it's going to turn out or if it's actually going to be, actually going to be something on par with the Manhattan Project. I mean, you know,

Speaker 1 this could all be just Silicon Valley people building technology and, you know, just kind of like having delusions of grandeur. So I don't know how it's going to turn out.

Speaker 2 I mean, if the scaling stuff is true, then it's more bigger than the Manhattan Project.

Speaker 1 Yeah,

Speaker 1 it certainly could be bigger. I just, you know, we should always kind of...
I don't know, maintain this attitude that it's really easy to fool yourself.

Speaker 2 If you were asked by the government, if you're a physicist during World War II and you were asked by the government to contribute non-replaceable research to the Manhattan Project, well, what do you think you would have said?

Speaker 1 Yeah, I mean, I think given you're in a war with the Nazis,

Speaker 1 at least during the period when you thought that the Nazis were, I don't, yeah, I don't really see much choice

Speaker 1 but to do it, if it's possible, you know, you have to figure it's going to be done within 10 years or so by someone.

Speaker 2 Regarding cybersecurity,

Speaker 2 what should we make of the fact that there's a whole bunch of tech companies which have ordinary tech company security policy that publicly seeming facing it's not obvious that they've been hacked like coinbase still has its bitcoin um you know google as far as i know my gmail hasn't been leaked should we take from that that current status code tech company security practices are good enough for a gi or just simply that nobody has tried hard enough it would be hard to for me to speak to you know current tech company practices and of course there may be many attacks that we don't know about where things are stolen and then silently used you know i mean i think an indication of it is when someone really cares, basically cares about attacking someone, then often the attacks happen.

Speaker 1 So, you know, recently we saw that some fairly high officials of the US government had their email accounts hacked via Microsoft. Microsoft was providing the email accounts.

Speaker 1 So, you know, presumably that related to information that was of great interest to

Speaker 1 foreign adversaries.

Speaker 1 And so

Speaker 1 it seems to me, at least, you know, that the evidence is more consistent with, you know, when something is really high enough value, then, you know, then,

Speaker 1 you know, someone acts and it's stolen. And my worry is that, of course, with AGI, we'll get to a world where, you know, the value is seen as incredibly high, right?

Speaker 1 That, you know, it'll be like stealing nuclear missiles or something. You can't be too careful on this stuff.

Speaker 1 And, you know, at every place that I've worked, I've pushed for the cybersecurity to be better. One of my concerns about cybersecurity is, you know, it's not kind of something you can trumpet.

Speaker 1 I think a good dynamic with safety research is like,

Speaker 1 you know, you can get companies into a dynamic, and I think we have, where, you know, you can get them to compete to do the best safety research and, you know, kind of use it as a, I don't know, like a recruiting point of competition or something.

Speaker 1 We used to do this all the time with interpretability, you know, and then sooner or later, other, other orgs started recognizing the defect and started working on interpretability, whether or not, you know, that, you know, like whether or not that was a priority to them before but i think it's harder to do that with cyber security because a bunch of the stuff you have to do in quiet and so you know we did try to put out one post about it but i think you know mostly you just you just see the results um you know i think people should you know a good norm would be you know people see these cyber security leaks from companies or you know leaks of the model parameters or something and say you know that they they screwed up that's that's that that's bad if i'm a safety person i might not want to work there um of course as soon as i as soon as i say that we'll probably have a security breach tomorrow.

Speaker 1 But

Speaker 1 that's part of the game here, right? I think that's part of

Speaker 1 trying to make things safe.

Speaker 2 I want to go back to the thing we're talking about earlier where the ultimate level of cybersecurity required for two to three years from now and whether it requires a bunker.

Speaker 2 Like, are you actually expecting to be in a physical bunker in two to three years, or is that just a metaphor?

Speaker 1 Yeah, I mean, I think that's a metaphor.

Speaker 1 You know, we're still figuring it out.

Speaker 1 Like, something I would think about is like, I think security of the data center, which may not be in the same physical location as us, but you know, we've worked very hard to make sure it's in the United States.

Speaker 1 But securing the physical data centers and the GPUs, I think some of the really expensive attacks, if someone was really determined, just involved going into the data center and just trying to steal the data directly or as it's flowing from a data center to, you know, to us.

Speaker 1 I think these data centers are going to have to be built in a very special way.

Speaker 1 I mean, given the way things are scaling up, you know, we're probably anyway heading to a world where, you know, the, you know, networks of data centers, you know, cost as much as aircraft carriers or something.

Speaker 1 And so, you know, they're already going to be pretty unusual objects.

Speaker 1 But I think addition to being unusual in terms of their ability, you know, to link together and train gigantic, gigantic models, they're also going to have to be very secure.

Speaker 2 Speaking of which, how, you know, there's been sorts of rumors on the difficulty of procuring the power and the GPUs for the next generation of models.

Speaker 2 What has the process been like to secure the necessary components to do the next generation?

Speaker 1 That's something I can't go into great detail about. You know, I will say, look,

Speaker 1 people think of even industrial scale data centers, right? People are not thinking at the scale that I think these models are going to go to very soon.

Speaker 1 And so whenever you do something at a scale where it's never been done before,

Speaker 1 every single component, every single thing has to be done in a new way than it was before. And so, you know,

Speaker 1 you may run into problems with surprisingly simple components. Power is one that you mentioned.

Speaker 2 And is this something that Anthropic has to handle, or can you just outsource it?

Speaker 1 For data centers, we work with cloud providers, for instance.

Speaker 2 What should we make about the fact that these models require so much training and the entire corpus of internet data in order to be subhuman? Whereas, you know, if GPT-4, there's been estimates that

Speaker 2 it was like 10 to the 25 flops or something, where, you know, whereas, I mean, you can take these numbers to a grain of salt, salt, but there's reports that the human brain from the time it is born to the time a human being is 20 years old, that's like on the order of 10 to the 20 flops to simulate all those interactions.

Speaker 2 We don't have to go into particulars on those numbers, but should we be worried about how sample inefficient these models seem to be?

Speaker 1 Yeah, so I think that's one of the remaining mysteries. One way you could phrase it is that the models are

Speaker 1 maybe two to three orders of magnitude smaller than the human brain if you compare to the number of synapses, while at the same time being trained on, you know, three to four more orders of magnitude of data.

Speaker 1 If you compare to, you know, number of words

Speaker 1 a human sees as they're developing to age 18, it's, I don't remember exactly, but I think it's in the hundreds of millions.

Speaker 1 Whereas for the models, we're talking about the hundreds of billions to the trillions. So, what explains this?

Speaker 1 There are these offsetting things where the models are smaller, they need a lot more data, and they're still below human level. But so,

Speaker 1 you know, there's some way in which,

Speaker 1 you know, the analogy to the brain is not quite right or is breaking down, or

Speaker 1 there's some missing factor.

Speaker 1 You know, this is just kind of like in physics where it's like, you know, we can't explain the Mickelson-Morley experiment or like, I'm forgetting one of the other 19th-century physics paradoxes, but like, I think it's one thing we don't quite understand, right?

Speaker 1 Humans see so little data and they still do fine.

Speaker 1 One theory on it, it could be that it, you know, it's, it's like our other modalities.

Speaker 1 You know, how do we get, you know, 10 to the 14th bits into the human brain?

Speaker 1 Well, well, most of it is kind of these images, and maybe a lot of what's going on inside the human brain is like, you know, our mental workspace involves all these, these, you know, these simulated images or something like that.

Speaker 1 But honestly, I think intellectually we have to admit that that's a weird thing that doesn't match up. And, you know, it's one reason I'm a bit.
you know, skeptical of kind of biological analogies.

Speaker 1 I thought in terms of them like five or six years ago, but now that we actually have these models in front of us as artifacts, it feels like almost all the evidence from that has been screened off by what we've seen.

Speaker 1 And what we've seen are models that are much smaller than the human brain and yet yet can do a lot of the things that humans can do and yet paradoxically require a lot more data.

Speaker 1 So maybe we'll discover something that makes it all efficient, or maybe we'll understand why the discrepancy is present. But at the end of the day, I don't think it matters, right?

Speaker 1 If we keep scaling the way we are, I think what's more relevant at this point is just measuring the abilities of the model and seeing how far they are from humans, and they don't seem terribly far to me.

Speaker 2 Does this scaling picture and the big blob of compute more generally, does that underemphasize the role that algorithmic progress has played when you compose the

Speaker 2 big blob of compute? So, you know, you're talking about LSTMs, presumably, at that point. Presumably, the scaling on that would not have you at cloud two at this point.

Speaker 2 So, are you underemphasizing the role that an improvement of the scale of Transformer could be having here when you put it behind the label of scaling?

Speaker 1 This big blob of compute document, which I still have not made public, I probably should for like historical reasons. I don't think it would tell anyone anything they don't know now.

Speaker 1 But when I wrote it, I actually said, Look, there are seven factors that, and you know, I wasn't, I wasn't like, these are the factors, but I was just like, let me give some sense of the kinds of things that matter and what don't.

Speaker 1 And so I wasn't thinking like, these are the, you know, there could be nine, there could be five, but like

Speaker 1 the things I said were: I said, number of parameters, scale of the model, like, you know, the compute and compute matters, quantity of data matters, quality of data matters, loss function matters.

Speaker 1 So, like, you know, are you doing RL? Are you doing next word prediction? If your loss function isn't rich or doesn't incentivize the right thing, you won't, you won't get anything.

Speaker 1 Um, so those were the key four ones, uh, which I think are the core of the hypothesis. But then I said three more things.

Speaker 1 One was symmetries, which is basically like if your architecture doesn't take into account the right kinds of symmetries, it doesn't work or it's very inefficient.

Speaker 1 So for example, convolutional neural networks take into account translational symmetry. LSTMs take into account time symmetry.

Speaker 1 But a weakness of LSTMs is that they can't attend over the whole context. So there's kind of this structural weakness.

Speaker 1 Like if a model isn't structurally capable of like absorbing and managing things that happened in a far enough distant past, then it's just like, it's kind of like, you know, like the compute doesn't flow, like the spice doesn't flow.

Speaker 1 It's like, you can't like, like the, the blob has to be unencumbered, right? It kind of, it's not, it's not going to work if, if you artificially close things off.

Speaker 1 And I think RNNs and LSTMs artificially close things off because they, they close you off to the distant past.

Speaker 1 And so, again, things need to flow freely. If they don't, it doesn't work.
And then, you know, I added a couple of things. One of them was like conditioning, which is like, you know,

Speaker 1 if the thing you're optimizing with is just really numerically bad, like you're going to have trouble. And so this is why like Atom works better than, you know, than normal SDD.

Speaker 1 And I think I'm forgetting what the seventh condition was, but it was it was similar to things like this, where it's like, you know, if you're, if you, if you set things up in kind of a way that's, that's set up to fail or that doesn't allow the compute to work in an uninhibited way, then it won't work.

Speaker 1 And so transformers were kind of within that, even though I can't remember if the transformer paper had been published. It was around the same time as I wrote that document.

Speaker 1 It might have been just before. It might have been just after.

Speaker 2 It sounds like, from that view, that

Speaker 2 the way to think about these algorithmic progresses is not as increasing the power of the blob of compute, but simply getting rid of the artificial hindrances that older architectures have.

Speaker 2 Is that a fair thing?

Speaker 1 Yeah, that's a little how I think about it. You know, again, if you go back to like Ilya's like the models want to learn, like the compute wants to be free.
Yeah, yeah.

Speaker 1 And like, you know, it's being blocked in various ways where you like don't understand that it's being blocked until you need to like free it up. Right, right.

Speaker 2 I love the

Speaker 2 radiance occasionally at the spice. Okay.

Speaker 2 On that point though, so do you think that another thing on the scale of a transformer

Speaker 2 is coming down the pike to enable

Speaker 2 the next great iterations?

Speaker 1 I think it's possible. I mean, people have worked on things like trying to model

Speaker 1 very long time dependencies or, you know, you know, there's various different ideas where I could see that we're kind of missing an efficient way of representing or dealing with something.

Speaker 1 So I think those inventions are possible. I guess my perspective would be, even if they don't happen,

Speaker 1 we're already on this very, very steep trajectory. And so unless, I mean, we're constantly trying to discover them as are, as are others, but things are already on such a fast trajectory.

Speaker 1 All that would do is speed up the trajectory even more.

Speaker 1 And probably, probably not by that much, because it's already going so fast.

Speaker 2 Is something embodied or having an embodied version of a model, is that at all important in terms of getting either data or progress?

Speaker 1 I'd think of that less in terms of the, you know, like a new architecture and more in terms of like a loss function, like the data, the environments you're exposing yourself to end up being very different.

Speaker 1 And so I think that could be important for learning some skills, although data acquisition is hard. And so things have gone through the language route and I would guess will

Speaker 1 continue to go through the language route even as, you know, even as more as possible in terms of embodiment.

Speaker 2 And then the other possibilities you mentioned, RL, you can see it as.

Speaker 1 Yeah, I mean, we kind of already do RL with RLHF, right? People are like, is this an alignment? Is it capabilities? I always think in terms of the two snakes, right?

Speaker 1 They're kind of often hard to distinguish. So we already kind of use RL in these language models, but I think we've used RL less in terms of getting them to take actions and do things in the world.

Speaker 1 But when you take actions over a long period of time and understand the consequences of those actions only later, then RL is a typical tool we have for that.

Speaker 1 So I would guess that, in terms of models taking action in the world, that RL will

Speaker 1 become a thing with all the power and all the safety issues that come with it.

Speaker 2 When you project out in the future, do you see the way in which these things will be integrated into productive supply chains?

Speaker 2 Do you see them talking with each other and criticizing each other and contributing to each other's output? Or is it just the model of one shots, one model of one shots, the answer or the work?

Speaker 1 Models will undertake extended tasks. That will have to be the case.
I mean, we may want to limit that to some extent because it may make some of the safety problems easier.

Speaker 1 But, you know, some of that I think will be required. In terms of our models talking to models, or are they talking to humans? Again, this goes kind of out of the technical realm and into the like

Speaker 1 socio-cultural economic realm where my heuristic is always that it's very, very difficult to predict things.

Speaker 1 And so I feel like these scaling laws have been very predictable. But then when you say, like, well, you know, when is there going to be a commercial explosion in these models?

Speaker 1 Or what's the form it's going to be? Or are the models going to do things instead of humans or pairing with humans? I feel like certainly my track record on predicting these things is terrible.

Speaker 1 But I also looking around, I don't really see anyone whose track record is great.

Speaker 2 You mentioned how fast progress is happening, but also the difficulties of integrating within the existing economy into the way things work.

Speaker 2 Do you think there will be enough time to actually have large revenues from AI products before the next model is just so much better or we're in like a different landscape entirely?

Speaker 1 It depends what you mean by large, right? You know, I think multiple companies are already in the hundred million to billion per year range.

Speaker 1 Will it get to the hundred billion or trillion range, you know, before I,

Speaker 1 that stuff is just so hard to predict, right? It's, and it's, it's, it's not even super well defined.

Speaker 1 Like, you know, I think right now there are companies that are throwing a lot of money at generative AI, you know, as customers, but, and, and they'll, you know, I think, I think that's the right thing for them to do.

Speaker 1 And they'll, you know, they'll find uses for it, but it doesn't mean they're, doesn't mean it's, you know, they're finding uses or the best uses from day one.

Speaker 1 So even money changing hands is not, is not quite the same thing as economic value being created.

Speaker 2 But surely you've thought about this from the perspective of Anthropic, where these things are happening so fast, then it should be an insane valuation, right?

Speaker 1 Even us who have, you know, not been super focused on commercialization and more on safety, I mean, you know, the graph goes up

Speaker 1 and it goes up, it goes up relatively quickly. Yeah.
So, you know, I can, I can only imagine what's happening at, you know, the orgs or, you know, this is, this is, this is their singular focus.

Speaker 1 So it's certainly happening fast, but, you know, again, it's, it's like it's the exponential from the small base while the technology itself is moving fast.

Speaker 1 So it's, it's kind of a race between how fast the technology is getting better and how fast it's integrated into the economy. And I think that's just a very unstable and turbulent process.

Speaker 1 Both things are going to happen fast. But if you ask me exactly how it's going to play out, exactly what order things are going to happen,

Speaker 1 I don't know, and I'm kind of skeptical of the ability to predict.

Speaker 2 But I'm kind of curious with regards to Anthropic specifically, you're a public benefit corporation. Yes.
And rightfully so, you want to make sure that this is an important technology.

Speaker 2 Obviously, the only thing you want to care about is not shareholder value. But how do you talk to investors who are putting in like hundreds of millions, billions of dollars of money?

Speaker 2 Like, how do you talk to them about the fact that how do you get them to put in this amount of money without the shareholder value being the main concern?

Speaker 1 So, so I think the LTBT is

Speaker 1 the right thing on this, right? You know, I mean, we're going to talk more about the LTBT, but like some version of that has been in development since the beginning of

Speaker 1 Anthropic, even formally, right?

Speaker 1 And so, you know, from the beginning, you know, even as the body has changed in some ways, it's like from the beginning, it was like, this body is going to exist. And it's, you know, it's unusual.

Speaker 1 Like every traditional investor who invests in Anthropic, you know, has to, you know, looks at this. Some of them are just like, whatever, you run your company how you want.

Speaker 1 Some of them are like, you know, oh my God, like this, this.

Speaker 1 you know, this body of random people, or to them, random people could like, you know, could move Anthropic in a direction that's, you know, that's totally contrary to our.

Speaker 1 And now there are, there are legal limits on that, of course, but you know, we have to have this conversation with every investor.

Speaker 1 And then it gets into a conversation of, well, what are the kinds of things that, you know, that we would we might do that would be contrary to the, to the, you know, to the interests of traditional investors.

Speaker 1 And just having those conversations has helped get everyone on the same page.

Speaker 2 I want to talk about the physics and the fact that so many of the founders and the employees at Anthropic are physicists.

Speaker 2 What is the, I mean, we talked in the beginning about the scaling laws and how the power laws from physics are something you see here, but

Speaker 2 what are the actual approaches and ways of thinking from physics that seem to have carried over so well? Is that notion of effective theory super useful?

Speaker 2 What is going on here?

Speaker 1 I mean, I think part of it is just physicists learn things really fast. We have generally found that

Speaker 1 if we hire someone who is a physics PhD or something, that they can learn ML and contribute just very, very quickly in most cases. And because several of our founders, myself, Jared Kaplan,

Speaker 1 Sam McCandlish,

Speaker 1 were physicists, we knew a lot of other physicists, and so we were able to hire them. And now there's, I don't know how many of these exactly might be 30 or 40 of them here.

Speaker 1 ML is not still not yet a field that has an enormous amount of depth. And so they've been able to get up to speed very quickly.

Speaker 2 Are you concerned that there's like a lot of people who would have been doing physics or something,

Speaker 2 whatever, they go into finance instead. And since Anthropic exists, they have now been recruited to go into AI.
And, you know, they're, you obviously care about AI safety, but,

Speaker 2 you know, maybe in the future they leave and they get funded to do their own thing. Is that a concern that you're bringing more people into the ecosystem here?

Speaker 1 Yeah, I mean, you know, I think there's, there's like a broad set of action, you know, like we're causing GPUs to exist.

Speaker 1 You know, there's, there's a lot of kind of side effects that you can't, that, that you can't currently control or that you just incur if you buy into the idea that you need to build frontier models.

Speaker 1 And that's one of them. A lot of them would have happened anyway.
I mean, finance was a hot thing 20 years ago. So physicists were doing it.
Now ML is a hot thing.

Speaker 1 And, you know, it's not like we caused them to do it when they had no interest previously. But, you know, again, you know, at the margin, you're kind of, you're kind of bidding things up.

Speaker 1 And, you know, a lot of that would have happened anyway. Some of it, some of it wouldn't, but it's all part of the calculus.

Speaker 2 Do you think that cloud has conscious experience? How likely do you think that is?

Speaker 1 This is another of these questions that just seems very unsettled and uncertain.

Speaker 1 One thing I'll tell you is I used to think that we didn't have to worry about this at all until models were kind of like operating in rich environments, like not necessarily embodied, but like that, you know, they, you know, they needed to like have a reward function and like have kind of long-lived experience.

Speaker 1 So I still think that might be the case. But the more we've looked at kind of these language models and particularly looked inside them to see things like induction heads.

Speaker 1 A lot of the cognitive machinery that you would need for active agents seems kind of already present in the base language models.

Speaker 1 So I'm not quite as sure as I was before that we're missing the things that, you know, that we're missing enough of the things that you would need.

Speaker 1 I think today's models just probably aren't smart enough that we should worry about this too much, but I'm not 100% sure about this.

Speaker 1 And I do think the models will get in a year or two, like this might be a very real concern.

Speaker 2 What would change if you found out that they are conscious? Are you worried that you're pushing the negative gradient to suffering?

Speaker 1 Like, what is conscious is again one of these words that I suspect it will like not end up having a well-defined

Speaker 1 that's a spectrum, right?

Speaker 1 Uh, so I don't know if we if we if we discover like that, you know, that I should care about Claude, let's say we discover that I should care about Claude's experience as much as I should care about like a dog or a monkey or something.

Speaker 1 Yeah,

Speaker 1 I would be kind of kind of worried. I don't know if their experience is positive or negative.

Speaker 1 Unsettlingly, I also don't know, like, I wouldn't know if any intervention that that we made was more likely to make Claude, you know, have a positive versus negative experience versus not having one.

Speaker 1 If there's an area that is helpful with this, it's maybe mechanistic interpretability, because I think of it as neuroscience for models. And so it's possible that we could

Speaker 1 shed some light on this. Although, you know, it's not a straightforward factual question, right? It kind of depends what we mean and what we value.

Speaker 2 We talked about this initially, but I want to get more specific.

Speaker 2 We talked initially about, you know, now that you're seeing these capabilities ramp up within within the human spectrum, you think that the human spectrum is wider than we thought.

Speaker 2 But more specifically, what have you, how is the way you think about human intelligence different now that the way you're seeing

Speaker 2 these marginally useful abilities emerge, how does that change your picture of what intelligence is?

Speaker 1 I think for me, the big realization on what intelligence is came with the like blob of compute thing, right? Like it's not, you know, there might be all these separate modules.

Speaker 1 There might be all this complexity.

Speaker 1 You know, it's, it's, you know, Rich Sutton called it the bitter lesson, right? It's almost called, has many names. It's been called the scaling hypothesis.

Speaker 1 Like the first few people who figured it out was around 2017. I mean, you could go further back to, I think Shane Laig was maybe the first person who really knew it.

Speaker 1 Maybe Ray Kurzweil, although in a very vague way.

Speaker 1 But, you know, I think the number of people who understood it went up a lot around 2014 to 2017. But I think that was the big realization.
It's like, you know, well, how did intelligence evolve?

Speaker 1 Well, if you don't need very specific conditions to create it, if you can create it just from like the right, kind of the right kind of gradient and loss signal, then of course it's not so mysterious how it all happened in terms, you know, it had this click of scientific understanding.

Speaker 1 In terms of like watching what the models can do, how has it changed my view of human intelligence. I wish I had something more intelligent to say on that.

Speaker 1 I feel like, I don't know, one thing that's been surprising is like, I thought things might click into place a little more than they do.

Speaker 1 Like, you know, I thought like different cognitive abilities might all be connected and there was more of one secret behind them, but it's like the model just learns various things at different times, you know, and it can be like very good at coding, but like, you know, it can't quite, you know, prove the prime number theorem yet.

Speaker 1 And I don't know, I mean, I guess it's a little bit the same for humans, although it's weird, the juxtaposition of things it can do and not.

Speaker 1 I guess the main lesson is is like having theories of intelligence or how intelligence works. Like,

Speaker 1 again, a lot of these words just kind of like dissolve into a continuum, right? They just kind of like

Speaker 1 dematerialize. I think less in terms of intelligence and more in terms of what we see in front of us.

Speaker 2 Yeah, no, it's really surprising to me. Two things.
One is how discrete these like different paths of intelligent

Speaker 2 things I contribute to LAS are rather than just being like one reasoning circuit or one general intelligence.

Speaker 2 And the other thing, talking with you, that is surprising or interesting, is many years from now, it'll be one of those things that looking back, it'll be why did, why weren't why wasn't this obvious to you?

Speaker 2 If you're seeing these smooth scaling curves, why the time where you're not completely convinced?

Speaker 2 So you've been less public than the CEOs of other AI companies. You know, you're not posting on Twitter.
You're not doing a lot of podcasts except for this one.

Speaker 2 What gives?

Speaker 1 Why are you off the radar? Yeah, I aspire to this and I'm proud of this.

Speaker 1 If people think of me as kind of like boring and low profile, like this is actually kind of what I want.

Speaker 1 So I don't know.

Speaker 1 I've just seen a number of cases, a number of people I've worked with that I think you could say Twitter, although I think I mean a broader thing, like just kind of like attaching your incentives very strongly to like the approval or cheering of a crowd.

Speaker 1 I think that can destroy your mind. And in some cases, it can destroy your soul.

Speaker 1 And so I think I've kind of deliberately tried to be a little bit bit low profile because I want to, I don't know, kind of like defend my ability to think about things intellectually in a way that's different from other people and isn't isn't kind of tinged by the approval of other people.

Speaker 1 So, you know, I've seen cases of folks who are deep learning skeptics and they become known as deep learning skeptics on Twitter.

Speaker 1 And then even as it starts to become clear to me, they've kind of sort of changed their mind.

Speaker 1 They like, this is their thing on Twitter and they can't change their Twitter persona and so forth and so on.

Speaker 1 I don't really like the trend of kind of like personalizing companies, like the whole, you know, like cage match between CEOs approach.

Speaker 1 Like, I think it, it distracts people from the actual merits and concerns of like the, the, you know, the, the company in question.

Speaker 1 Like, I, I kind of want people to like judge the like nameless bureaucratic institution. Um,

Speaker 1 you know, I want people to think in terms of the nameless bureaucratic institution and its incentives more than they think in terms of me.

Speaker 1 Everyone wants a friendly face, but actually I think friendly faces can be misleading. Okay.

Speaker 2 Well, in this case, this will be a misleading interview because this has been a lot of fun. It'd be like a blast to talk to.

Speaker 1 Indeed.

Speaker 2 Yeah, this is a blast. I'm super glad you came on the podcast and hope people enjoyed.

Speaker 1 Thanks. Thanks for having me.

Speaker 2 Hey, everybody. I hope you enjoyed that episode.
As always, the most helpful thing you can do is to share the podcast. Send it to people you think might enjoy it.

Speaker 2 Put it in Twitter, your group chats, et cetera. It just splits the world.
I appreciate appreciate you listening. I'll see you next time.
Cheers.