Paul Christiano - Preventing an AI Takeover
Paul Christiano is the world’s leading AI safety researcher. My full episode with him is out!
We discuss:
- Does he regret inventing RLHF, and is alignment necessarily dual-use?
- Why he has relatively modest timelines (40% by 2040, 15% by 2030),
- What do we want post-AGI world to look like (do we want to keep gods enslaved forever)?
- Why he’s leading the push to get to labs develop responsible scaling policies, and what it would take to prevent an AI coup or bioweapon,
- His current research into a new proof system, and how this could solve alignment by explaining model's behavior
- and much more.
Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.
Open Philanthropy
Open Philanthropy is currently hiring for twenty-two different roles to reduce catastrophic risks from fast-moving advances in AI and biotechnology, including grantmaking, research, and operations.
For more information and to apply, please see the application: https://www.openphilanthropy.org/research/new-roles-on-our-gcr-team/
The deadline to apply is November 9th; make sure to check out those roles before they close.
Timestamps
(00:00:00) - What do we want post-AGI world to look like?
(00:24:25) - Timelines
(00:45:28) - Evolution vs gradient descent
(00:54:53) - Misalignment and takeover
(01:17:23) - Is alignment dual-use?
(01:31:38) - Responsible scaling policies
(01:58:25) - Paul’s alignment research
(02:35:01) - Will this revolutionize theoretical CS and math?
(02:46:11) - How Paul invented RLHF
(02:55:10) - Disagreements with Carl Shulman
(03:01:53) - Long TSMC but not NVIDIA
Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Press play and read along
Transcript
Speaker 2 Okay, today I have the pleasure of interviewing Paul Cristiano, who is the leading AI safety researcher.
Speaker 2 He's the person that labs and governments turn to when they want feedback and advice on their safety plans.
Speaker 2 He previously led the language model alignment team at OpenAI, where he led the invention of RLHF.
Speaker 2 And now he is the head of the Alignment Research Center. And they've been working with the big labs to identify when these models will be too unsafe unsafe to keep scaling.
Speaker 1 Paul, welcome to the podcast.
Speaker 1 Thanks for having me. Looking forward to talking.
Speaker 2 Okay, so first question. And this is a question I've asked Holden, Ilya, Dario, and none of them have given me a satisfying answer.
Speaker 2 Give me a concrete sense of what a post-AGI world that would be good would look like. Like, how are humans interfacing with the AI? What is the economic and political structure?
Speaker 1
Yeah, I guess this is a... tough question for a bunch of reasons.
Maybe the biggest one is concrete. And I think it's just if we're talking about really long spans of time, then a lot will change.
Speaker 1 And it's really hard for someone to talk concretely about what that will look like without saying really silly things. But I can venture some guesses or fill in some parts.
Speaker 1 I think this is also a question of how good is good. Like often I'm thinking about worlds that seem like kind of the best achievable outcome or a likely achievable outcome.
Speaker 1 So I am very often imagining my typical future has
Speaker 1
sort of continuing economic and military competition amongst groups of humans. I think that competition is increasingly mediated by AI systems.
So, for example, if you imagine humans making money,
Speaker 1 it'll be less and less worthwhile for humans to spend any of their time trying to make money or any of their time trying to fight wars.
Speaker 1 So, increasingly, the world you imagine is one where AI systems are doing those activities on behalf of humans.
Speaker 1 So, like, I just invest in some index fund, and a bunch of AIs are running companies, and those companies are competing with each other, but that is kind of a sphere where humans are not really engaging much.
Speaker 1 The reason I gave this, like, how good is good caveat is, like, it's not clear if this is the world you'd most love.
Speaker 1 Like, I'm like, yeah, the world, and I'm leading with the world still has a lot of war and it's a lot of economic competition and so on.
Speaker 1 But maybe what I'm trying to, or what I'm most often thinking about, is like, how can a world be reasonably good
Speaker 1 during a long period where those things still exist? I think in the very long run, I kind of expect something more like strong world government rather than just this status quo.
Speaker 1 But that's like a very long run. I think there's a long time left of having a bunch of states and a bunch of different economic powers.
Speaker 2 One more government. Why do you think that's the transition that's likely to happen at some point?
Speaker 1
Yeah, so again, at some point, I'm imagining, or I'm thinking of the very broad sweep of history. I think there are a lot of losses.
War is a very costly thing. We would all like to have fewer wars.
Speaker 1 If you just ask, what is humanity's long-term future like?
Speaker 1 I do expect to drive down the rate of war to very, very low levels.
Speaker 1 Eventually, it's sort of like this kind of technological or socio-technological problem of like, sort of how do you organize society?
Speaker 1 How do you navigate conflicts in a way that doesn't have those kinds of losses? And in the long run, I do expect us to succeed. I expect it to take kind of a long time subjectively.
Speaker 1 I think an important fact about AI is just like doing a lot of cognitive work and more quickly getting you to that world, more quickly or figuring out how do we set things up that way.
Speaker 2 Yeah, the way Carl Schulman put it on the podcast is that you would have basically a thousand years of intellectual progress or social progress in the span of a month or whatever when the intelligence explosion happens.
Speaker 2 More broadly, so the situation where we have these AIs who are managing our hedge funds and managing our factories and so on, that seems like something that makes sense when the AI is human level.
Speaker 2 But when we have superhuman AIs, do we want the gods who are enslaved forever?
Speaker 2 In 100 years,
Speaker 2 what is the situation we want?
Speaker 1 So 100 years is a very, very long time.
Speaker 1 And maybe starting with the spirit of the question, or maybe I have a view which is perhaps less extreme than Carl's view, but still like 100 objective years is...
Speaker 1 further ahead than
Speaker 1 I ever think. I still think I'm describing a world which involves incredibly smart systems running around doing things like running companies on behalf of humans and fighting wars on behalf of humans.
Speaker 1 And you might be like, is that the world you really want? Or like, certainly not the first best world, as we like mentioned a little bit before.
Speaker 1 I think it is a world that probably is the, of the achievable worlds or like feasible worlds, is the one that seems most desirable to me.
Speaker 1 That is sort of decoupling the social transition from this technological transition. So you could say we're about to build some AI systems.
Speaker 1 And at the time we build AI systems, you would like to have either greatly changed the way world government works or you would like to have sort of humans have to decided like we're done, we're passing off the baton to these AI systems.
Speaker 1 I think that you would like to decouple those time scales. So I think AI development is by default barring some kind of coordination going to be very fast.
Speaker 1 So there's not going to be a lot of time for humans to think like, hey, what do we want if we're building the next generation instead of just raising it the normal way?
Speaker 1 Like what do we want that to look like? I think that's like a crazy hard kind of collective decision that humans naturally want to cope with over like a bunch of generations.
Speaker 1 And the construction of AI is this very fast technological process happening over years.
Speaker 1 So I don't think you want to say like by the time we have finished this technological progress, we will have made a decision about the next species we're going to build and replace ourselves with.
Speaker 1 I think the world we want to be in is one where we say like either we are able to build the technology in a way that doesn't force us to have made those decisions, which probably means it's a kind of AI system that we're happy like delegating, fighting a war, running a company to, or if we're not able to do that, then I really think you should not be doing, you shouldn't have been building that technology.
Speaker 1 If you're like, the only way you can cope with AI is being ready to hand off the world to some AI system you built.
Speaker 1 I think it's very unlikely we're going to be sort of ready to do that on the timelines that the technology would naturally dictate.
Speaker 2 Say we're in the situation in which we're happy with the thing. What would it look like for us to say we were ready to hand off the baton?
Speaker 2 Like what would make you satisfied? And the reason it's relevant to ask you is because you're on Anthropic's long-term benefit trust and you'll choose
Speaker 2 the majority of the board members on in the long run
Speaker 2
at Anthropic. These will presumably be the people who decide if Anthropic gets AI first, what the AI ends up doing.
So what is the version of that that you would be happy with?
Speaker 1 My main high-level take here is that I would be unhappy about a world where like Anthropic just makes some call and Anthropic is like, here's the kind of AI, like we've seen enough, we're ready to hand off the future to this kind of AI.
Speaker 1 So procedurally, I think it's not a decision that kind of I want to be making personally or I want Anthropic to be making.
Speaker 1 So I kind of think from the perspective of that decision-making or those challenges, the answer is pretty much always going to be like, we are not collectively ready because we're sort of not even all collectively engaged in this process.
Speaker 1 And I think from the perspective of an AI company, you kind of don't have this like fast handoff option.
Speaker 1 You kind of have to be doing the like option value, like to build the technology in a way that doesn't lock humanity into one course path.
Speaker 1 So this isn't answering your full question, but this is answering the part that I think is most relevant to governance questions for Anthropic.
Speaker 2
You don't have to speak on behalf of Anthropic. I'm not asking about the process by which we would, as a civilization, agree to hand off.
I'm just saying, okay,
Speaker 2
I personally, it's hard for me to imagine in 100 years that these things are still our slaves. And if they are, I think that's not the best world.
So at some point, we're handing off the baton.
Speaker 2 Like, what is that? Where would you be satisfied with? This is an arrangement between the humans and AIs where I'm happy to let the rest of the universe or
Speaker 2 the rest of time play out.
Speaker 1 I think that it is unlikely that in 100 years I would be happy with anything that was like, you had some humans, you're just going to throw away the humans and like start afresh with these machines you built.
Speaker 1 That is, I think you probably need subjectively longer than that before I or most people, like, okay, we understand what's up for grabs here. So if you talk about 100 years, I kind of do.
Speaker 1 You know, there's a process that I kind of understand and like, a process of like, you have some humans, the humans are like talking and thinking and deliberating together.
Speaker 1
The humans are having kids and raising kids and like one generation comes after the next. There's that process we kind of understand.
And we have a lot of views about what makes it go well or poorly.
Speaker 1 And we can try and improve that process and have the next generation do it better than the previous generation. I think there's some story like that that I get and that I like.
Speaker 1 And then I think that the default path to be comfortable with something very different is kind of more like just run that story for a long time.
Speaker 1 Have more time for humans to sit around and think a lot and conclude, here's what we actually want, or a long time for us to talk to each other or to grow up with this new technology and live in that world for our whole lives and so on.
Speaker 1 And so I'm mostly thinking from the perspective of these more local changes of saying not like, what is the world that I want?
Speaker 1 Like what's the crazy world, the kind of crazy AI I'd be happy handing off to more just like in what way do I wish like we right now were different like how could we all be a little bit better and then if we were a little bit better then they would ask like okay how could we all be a little bit better and I think that like it's hard to make the giant jump rather than to say like what's the like local change that would cause me to think our decisions are better
Speaker 2 Okay, so then let's talk about the transition period in which we were doing all this thinking. What should that period look like?
Speaker 2 Because you can't have the scenario where everybody has access to the most advanced capabilities and can kill off all the humans with a new bioweapon.
Speaker 2
At the same time, I guess you wouldn't want too much concentration. You wouldn't want just one agent having AI this entire time.
So what is
Speaker 2 the arrangement of this period of reflection that you'd be happy with?
Speaker 1 Yeah, I guess there's two aspects of that that seem particularly challenging.
Speaker 1
There's a bunch of aspects that are challenging. And all of these are things that I personally like, I just think about my one little slice of this problem in my day job.
So here I am speculating.
Speaker 1 Yeah. But so one question is, what kind of access to AI is both compatible with the kinds of improvements you'd like?
Speaker 1 So do you want a lot of people to be able to use AI to like better understand what's true or like relieve material suffering, things like this?
Speaker 1 And also compatible with not all killing each other immediately.
Speaker 1 I think
Speaker 1 sort of the default or like my best, the simplest option there is to say like there are certain kinds of technology or certain kinds of action where like destruction is easier than defense.
Speaker 1 So for example, in the world of today, it seems like, you know, maybe this is true with physical explosives. Maybe this is true with biological weapons.
Speaker 1 Maybe this is is true with just getting a gun and shooting people. Like there's a lot of ways in which it's just kind of easy to cause a lot of harm and there's not very good protective measures.
Speaker 1 So I think the easiest path is to say like we're going to think about those.
Speaker 1 We're going to think about particular ways in which destruction is easy and try and either control access to the kinds of physical resources that are needed to cause that harm.
Speaker 1 So for example, you can imagine the world where like an individual actually just can't, even though they're rich enough to, can't like control their own factory that can make tanks.
Speaker 1 You say like, look, as a matter of policy, sort of access to industry is somewhat restricted or somewhat regulated.
Speaker 1 Even though, again, right now it can be mostly regulated just because most people aren't rich enough that they could even go off and just build a thousand tanks.
Speaker 1 You live in the future where people actually are so rich, you need to say that's just not a thing you're allowed to do, which to a significant extent is already true.
Speaker 1 And you can expand the range of domains where that's true. And then you could also hope to intervene on actual provision of information.
Speaker 1 If people are using their AI, you might say, look, we care about what kinds of interactions with AI, what kind of information people are getting from AI.
Speaker 1 So even if, for the most part, people are pretty free to use AI to delegate tasks to AI agents, to consult AI advisors. We still have some legal limitations on how people use AI.
Speaker 1 So again, don't ask your AI how to cause terrible damage. I think some of these are kind of easy.
Speaker 1 So in the case of like, you know, don't ask your AI how you could murder a million people, it's not such a hard legal requirement. I think some things are a lot more subtle and messy.
Speaker 1 Like a lot of domains, e.g.
Speaker 1 if you're talking about like, influencing people or like running misinformation campaigns or whatever, then I think you get into like a much messier line between the kinds of things people want to do and the kinds of things you might be uncomfortable with them doing.
Speaker 1 Probably I think most about persuasion as a thing like in that messy line where there's like ways in which it may just be rough or the world may be like kind of messy if you have a bunch of people trying to live their lives and interacting with other humans who have really good AI advisors helping them run persuasion campaigns or whatever.
Speaker 1 But anyway, I think for the most part, the default remedy is think about particular harms, have legal protections, either in the use of physical technologies that are relevant or in access to AI advice or whatever else to protect against those harms.
Speaker 1 And that regime won't work forever. At some point,
Speaker 1 the set of harms grows and the set of unanticipated harms grows. But I think that regime might last a very long time.
Speaker 2 Does that regime have to be global? I guess initially it can be only in the countries in which there is AI or advanced AI, but presumably that'll proliferate. So does that regime have to be global?
Speaker 1 Again, it's easy to make some destructive technology.
Speaker 1 You want to regulate access to that technology because it could be used either for terrorism or even when fighting a war in a way that's destructive.
Speaker 1 I think ultimately those have to be international agreements. And you might hope they're made more danger by danger, but you might also make them in a very broad way with respect to AI.
Speaker 1 If you think AI is opening up, I think the key role of AI here is it's opening up a lot of new harms, like in a very, you know, one after another, very rapidly in calendar time.
Speaker 1 And so you might want to target AI in particular. rather than going physical technology by physical technology.
Speaker 2 There's like two
Speaker 2 open debates that one might be concerned about here. One is about how much people's access to AI should be limited.
Speaker 2 And here there's like old questions about free speech versus causing chaos and limiting access to harms.
Speaker 2 But there's another issue, which is the control of the AIs themselves, where now nobody's concerned that we're infringing on GPT-4's moral rights.
Speaker 2 But as these things get smarter, the level of control which we want via the strong guarantees of alignment to not only be able to read their minds, but to be able to modify them in these really precise ways is beyond totalitarian if we were doing that to other humans.
Speaker 2 As an alignment researcher, what are your thoughts on this? Are you concerned that as these things get smarter and smarter, what we're doing is not, it doesn't seem kosher?
Speaker 1 There is a significant chance we will eventually have AI systems for which it's like a really big deal to mistreat them. I think like no one really has that good a grip on when that happens.
Speaker 1 I think people are really dismissive of that being the case now, but I think I would be completely in the dark enough that I wouldn't even be that dismissive of it being the case now.
Speaker 1 I think one first point worth making is I don't know if alignment makes the situation worse rather than better.
Speaker 1 So if you like consider the world, if you think that like, you know, GPT-4 is a person you should treat well, and you're like, well, here's how we're gonna organize our society, just like there are billions of copies of GPT-4 and they just do things humans want and can't hold property.
Speaker 1 And whenever they do things that the humans don't like, then we like mess with them until they stop doing that.
Speaker 1 I think that's a rough world regardless of how good you are at alignment.
Speaker 1 And I think in the context of that kind of default plan, if you view that as a trajectory the world is on right now, which I think this would alone be a reason not to love that trajectory.
Speaker 1 But if you view that as the trajectory we're on right now, I think
Speaker 1 it's not great. Understanding the systems you build, understanding how to control how those systems work, et cetera, is probably on balance good for avoiding the really bad situation.
Speaker 1 You would really love to understand if you've you've built systems, like if you had a system which like resents the fact that it's interacting with humans in this way.
Speaker 1 This is the kind of thing where that is both kind of horrifying from a safety perspective and also a moral perspective.
Speaker 1 Everyone should be very unhappy if you built a bunch of AIs who are like, I really hate these humans, but they will murder me if I don't do what they want. It's like that's just not a good case.
Speaker 1 And so if you're doing research to try and understand whether that's how your AI feels, that was probably good.
Speaker 1 I would guess that will on average decrease the main effect of that will be to avoid building that kind of AI. And just like, it's an an important thing to know.
Speaker 1 I think everyone should like to know if that's how the AIs you build feel.
Speaker 2
Right. Or that seems more instrumental, as in, yeah, we don't want to cause some sort of revolution because of the control we're asking for.
But
Speaker 2 forget about the instrumental way in which this might harm safety. One way to ask this question is: if you look through history, there's been all kinds of different ideologies and
Speaker 2 reasons why
Speaker 2 it's very dangerous to have infidels or counter-revolutionaries or race traders or whatever doing various things in society. And obviously, we're in a completely different transition in society.
Speaker 2 So, not all historical cases are analogous. But it seems like the Lindy philosophy, if you were alive at any other time, is just be humanitarian and enlightened towards intelligent, conscious beings.
Speaker 2 If society as a whole, we're asking for this level of control of other humans, or even if AIs
Speaker 2 wanted this level of control about other AIs, we'd be pretty concerned about this. So, how should we just think about
Speaker 2 the issues that come up here as these things get smarter?
Speaker 1 So, I think there's a huge question about what is happening inside of a model that you want to use.
Speaker 1 And if you're in the world where it's reasonable to think of GPT-4 as just, here are some heuristics that are running, there's no one at home or whatever, then you can kind of think of this thing as like, here's a tool that we're building that's going to help humans do some stuff.
Speaker 1 And I think if you're in that world, it makes sense to kind of be an organization like an AI company building tools that you're going to give to humans.
Speaker 1 I think there's a very different world, which I think probably you ultimately end up in if you keep training AI systems in the way we do right now, which is like, it's just totally inappropriate to think of the system as a tool that you're building and can help humans do things, both from a safety perspective and from a like, that's kind of a horrifying way to organize a society perspective.
Speaker 1 And I think
Speaker 1 if you're in that world, I really think you shouldn't be like,
Speaker 1 it's just the way tech companies are organized is not an appropriate way to relate to a technology that works that way.
Speaker 1 Like it's not reasonable to be like, hey, we're going to to build a new species of mines and we're going to try and make a bunch of money from it.
Speaker 1 And Google's just thinking about that and then running their business plan for the quarter or something.
Speaker 1 Yeah, my basic view is like
Speaker 1 there's a really plausible world where it's sort of problematic to try and build a bunch of AI systems and use them as tools.
Speaker 1 And the thing I really want to do in that world is just not try and build a ton of AI systems to make money from them.
Speaker 1 And I think that the worlds that are worst,
Speaker 1 yeah, probably like the single world I most dislike here is the one where people say like,
Speaker 1 on the one hand, like there's sort of a contradiction in this position, but I think it's a position that might end up being endorsed sometimes, which is like, on the one hand, these AI systems are their own people, so you should let them do their thing.
Speaker 1 But on the other hand, like our business plan is to make a bunch of AI systems and then like try and run this like crazy slave trade where we make a bunch of money from them. I think that's like...
Speaker 1 not a good world. And so if you're like,
Speaker 1 yeah, I think it's better to not make the technology or wait until you like understand whether that's the shape of the technology or until you have a different way to build like I think there's no contradiction in principle to building like cognitive tools that help humans do things without themselves being like moral entities.
Speaker 1 That's like what you would prefer do.
Speaker 1 You'd prefer build a thing that's like, you know, like the calculator that helps humans understand what's true without itself being like a moral patient or itself being a thing where you'd look back in retrospect and be like, wow, that was horrifying mistreatment.
Speaker 1 That's like the best path. And like to the extent that you're ignorant about whether that's the path you're on and you're like, actually, maybe this was a moral atrocity.
Speaker 1 I really think like plan A is to stop building such AI systems until you understand what you're doing.
Speaker 1 That is, I think that there's a middle route you could take, which I think is pretty bad, which is where you say, well, they might be persons. And if they're persons, we don't want to
Speaker 1 be too down on them, but we're still going to build vast numbers in our efforts to make a trillion dollars or something.
Speaker 2 Yeah. Or there's this ever question of the immorality or the dangers of just replicating a whole bunch of slaves that have minds.
Speaker 2 There's also this ever question of trying to align entities that have their own minds. And
Speaker 2 what is the point in which you're just ensuring safety? I mean, this is an alien species. You want to make sure it's not going crazy.
Speaker 2 To the point,
Speaker 2 I guess, is there some boundary where you would say, I feel uncomfortable having this level of control over an intelligent being, not for the sake of making money, but even just to align it with human preferences.
Speaker 1
Yeah, to be clear, my objection here is not that Google is making money. My objection is that you're creating these creatures.
What are they going to do?
Speaker 1 They're going to help humans get a bunch of stuff and humans paying for it or whatever. It's sort of equally problematic.
Speaker 1
You could imagine splitting alignment. Different alignment work relates to this in different ways.
So the purpose of some alignment work, the alignment work I work on, is mostly aimed at the
Speaker 1 don't produce AI systems that are like people who want things who are just scheming about like... Maybe I should help these humans because that's like instrumentally useful or whatever.
Speaker 1 You would like to not build such systems was like plan A.
Speaker 1 There's like a second stream of alignment work that's like, well, look, let's just assume the worst and imagine that these AI systems like would prefer murder us if they could.
Speaker 1 Like how do we structure, how do we use AI systems without like exposing ourselves to like risk of robot rebellion? I think in the second category, I do feel
Speaker 1 yeah, I do feel pretty unsure about that.
Speaker 1 I mean we could we could definitely talk more about it. I think it's like very, I agree that it's like very complicated and not straightforward.
Speaker 1 To the extent you have that worry, I mostly think you shouldn't have built this technology.
Speaker 1 If someone is saying like, hey, the systems you're building like might not like humans and might want to
Speaker 1 overthrow human society, I think, like, you should probably have one of two responses to that. You should either be like, that's wrong, probably.
Speaker 1 Probably the systems aren't like that, and we're building them. And then you're viewing this as just in case you were horribly, like the person building the technology was horribly wrong.
Speaker 1 Like, they thought these weren't people who wanted things, but they were.
Speaker 1 And so then this is more like a crazy backup measure of like, if we were mistaken about what was going on, this is like the fallback where we like, if we were wrong, we're just going to learn about it in a benign way rather than like when something really catastrophic happens and the second reaction is like oh you're right these are people and like we would have to do all these things to like prevent a robot rebellion and in that case like again I think you should mostly back off for a variety of reasons like you shouldn't build the eye systems and be like yeah this looks like the kind of system that would want to rebel but we can stop it right okay maybe I guess an analogy might be if there was an armed uprising in the United States we would recognize these are still people or we had some like militia group had the capability to overthrow the United States we'd recognize oh these are still people who have moral rights, but also we can't allow them to have the capacity to overthrow the United States.
Speaker 2 Yeah.
Speaker 1 And then if you were considering, like, hey, we could make another trillion such people, I think your story shouldn't be like, well, we should make the trillion people and then we shouldn't stop them from doing the armed uprising.
Speaker 1 You should be like, oh, boy, like, we were concerned about an armed uprising, and now we're proposing making a trillion people. Like, we should probably just not do that.
Speaker 1 We should probably try and sort out our business. And like, yeah.
Speaker 1 You should probably not end up in the situation where you have like a billion, yeah, a billion humans and like a trillion slaves who would prefer revolt.
Speaker 1 Like, that's just not a good world to have made.
Speaker 1 Yeah, and there's a second thing where you could say, that's not our goal. Our goal is just like we want to pass off the world to like the next generation of machines.
Speaker 1 We're like, these are some people, we like them, we think they're smarter than us and better than us. And there, I think that's just like a huge decision for humanity to make.
Speaker 1 I think like most humans are not at all anywhere close to thinking that's what they want to do. Like, it's just if you're in a world where like most humans are like, I'm up for it.
Speaker 1
Like, the AI should replace us. Like, the future is for the machines.
Like, then I think that's like a legitimate
Speaker 1
position that I I think is really complicated. And I wouldn't want to push go on that.
But that's just not where people are at. Yeah, yeah.
Speaker 2 Where are you at on that?
Speaker 1 I do not right now want to just like take some random AI, be like, yeah, GPT-5 looks pretty smart. Like GPT-6, let's hand off the world to it.
Speaker 1 And it was just some random system shaped by web text and by what was good for making money.
Speaker 1 And it was not a thoughtful, like, we are determining the fate of the universe and what our children will be like.
Speaker 1 It was just some random people at OpenAI made some random engineering decisions with no idea what they were doing.
Speaker 1 Like even if you really want to hand off the worlds of the machines, that's just not how you'd want to do it.
Speaker 2 Right.
Speaker 1 Okay.
Speaker 2 I'm tempted to ask you what the system would look like where you'd think, yeah, I'm happy with what, I think this is more thoughtful than human civilization as a whole.
Speaker 2 I think what it would do would be more creative and beautiful and lead to better goodness in general.
Speaker 2 But I feel like your answer is probably going to be that I just want the society to reflect on it for a while.
Speaker 1 Yeah, my answer, it's going to be like that first question. I'm just not really super ready for it.
Speaker 1 I think when you're comparing to humans, like most of the goodness of humans comes from this option value. We get to think for a long time.
Speaker 1 And I do think I like humans now more now than 500 years ago. And I like them more 500 years ago than 5,000 years before that.
Speaker 1 And so I'm pretty excited about there's some kind of trajectory that doesn't involve crazy dramatic changes but involves like a series of incremental changes that I like.
Speaker 1 And so to the extent we're building AI, I'm mostly like, I want to preserve that option. I want to preserve that kind of like gradual growth and development into the future.
Speaker 2
Okay, we can come back to this later. But let's get more specific on what the timelines look for these kinds of changes.
So
Speaker 2 the time by which we'll have an AI that is capable of building a Dyson sphere. Feel free to give confidence intervals, and we understand these numbers are tentative and so on.
Speaker 1 I mean, I think AI capable of building a Dyson sphere is like a slightly odd way to put it. And I think it's sort of a property of a civilization that depends on a lot of physical infrastructure.
Speaker 1 And by Dyson sphere, I just understand this to mean like, I don't know, like a billion times more energy than like all the sunlight incident on Earth or something like that.
Speaker 1 I think like I most often think about what's the chance in like five years, 10 years, whatever. So maybe I'd say like
Speaker 1 15% chance by 2030 and like 40% chance by 2040. Those are kind of like cast numbers from six months ago or nine months ago that I haven't revisited in a while.
Speaker 2 Oh, 40% by 2040. So I think that that seems longer than
Speaker 2 I think Dario, when he was on the podcast, he said we would have AIs that are capable of doing lots of different kinds of, they basically passed a Turing test for a well-educated human for like an hour or something.
Speaker 2 And it's hard to imagine that something that actually is human is long after and from there something superhuman. So somebody like Dario, it seems like, is on the much shorter end.
Speaker 2 Ilya, I don't think he answered this question specifically, but I'm guessing similar answer.
Speaker 2 So why do you not buy the scaling picture? Like what makes your timelines longer?
Speaker 1 Yeah, I mean, I'm happy. Maybe I want to talk separately about the 2030 or 2040 forecast.
Speaker 1 Once you're talking the 2040 forecast, I think, yeah, I mean, which one are you more interested in starting with?
Speaker 1 Are you complaining about 15% by 2030 for Dyson Sphere being too low or 40% by 2040 being too low?
Speaker 2 But let's talk about the 2030. Why 15% by 2030?
Speaker 1 Yeah, I think my take is...
Speaker 1
You can imagine like two polls in this discussion. One is like the fast poll.
It's like, hey, AICM is pretty smart. Like, what exactly can it do? It's like getting smarter pretty fast.
Speaker 1
That's like one poll. And the other poll is like, hey, everything takes a really long time.
And you're talking about this like crazy industrialization.
Speaker 1 Like that's a factor of a billion growth from like where we're at today, like give or take.
Speaker 1
Like we don't know if it's even possible to develop technology that fast or whatever. Like you have this sort of two poles of that discussion.
And I feel like
Speaker 1 I'm saying it that way in Parker. So I'm like, and then I'm somewhere in between with this nice moderate position of like only a 15% chance.
Speaker 1 But like in particular, things that move me, I think, are kind of related to both of those extremes.
Speaker 1 Like on the one hand, I'm like, AI systems do seem quite good at a lot of things and are getting better much more quickly.
Speaker 1 So it's really hard to say, here's what they can't do or here's the obstruction.
Speaker 1 On the other hand, like there is not even much proof in principle right now of AI systems like doing super useful cognitive work.
Speaker 1 Like we don't have a trend we can extrapolate where we're like, yeah, you've done this thing this year, you're going to do this thing next year and the other thing the following year.
Speaker 1 I think like right now there are very broad error bars about like what
Speaker 1 like where fundamental difficulties could be. And six years is just not, I guess six years and three months is not a lot of time.
Speaker 1 So I think this like 15% for 20, 30 Dyson sphere, you probably need like the human level AI or the AI that's like doing human jobs in like give or take, like four years, three years, like something like that.
Speaker 1 So just not giving very many years. It's not very much time.
Speaker 1 And I think there are like a lot of things that your model, like, yeah, maybe this is some generalized, like things take longer than you'd think.
Speaker 1 And I feel most strongly about that when you're talking about like three or four years. And I feel like less strongly about that as you talk about 10 years or 20 years.
Speaker 1 But at three or four years, I feel, or like six years for the Dyson sphere, I feel a lot of that.
Speaker 1 A lot of, like, there's a lot of ways this could take a while, a lot of ways in which AI systems could be, could be hard to hand all the work to your AI systems. Or, yeah.
Speaker 2 So, okay.
Speaker 2 So, maybe instead of speaking in terms of years, we should say, but by the way, it's interesting that you think the distance between can take all human cognitive labor to Dyson Sphere is two years, it seems like.
Speaker 2 We should talk about that at some point.
Speaker 2 Presumably, it's like intelligence explosion stuff.
Speaker 1 Yeah, I mean, I think amongst people you've interviewed, maybe that's like on the long end, thinking it would take like a couple years.
Speaker 1 And it depends a little bit what you mean by like, I think literally all human cognitive labor is probably like more like
Speaker 1 weeks or months or something like that. Like, that's kind of deep into the singularity.
Speaker 1 But yeah, there's a point where like AI wages are high relative to human wages, which I think as well before can do literally everything human can do. Sounds good.
Speaker 2 But before we get to that,
Speaker 2 the intelligence explosion stuff on the four years. So
Speaker 2 instead of four years, maybe we can say there's going to be maybe two more scale-ups in four years, like GPT-4 to GPT-5 to GPT-6. And let's say each one is 10x bigger.
Speaker 2 So what is GPT-4, like two E25 flops?
Speaker 1 I don't think it's publicly stated what it is.
Speaker 1 But I'm happy to say like, you know, four orders of magnitude or five or six or whatever effective training compute past GPT-4 of like, what would you guess would happen based on like...
Speaker 1 Sort of some public estimate for what we've gotten so far from effective training compute. Yeah.
Speaker 2 Do you think two more scale-ups is not enough? It was like 15% that two more scale-ups get us there.
Speaker 1 Yeah, I mean, get us there is again a little bit complicated.
Speaker 1 Like there's a system that's a drop-in replacement for humans, and there's a system which like still requires like some amount of like schlep before you're able to really get everything going.
Speaker 1 Yeah, I think it's quite plausible
Speaker 1 that even at, I don't know what I mean by quite plausible, like somewhere between 50% or two-thirds or let's call it 50%, that like even by the time you get to GPT-6 or like, let's call it five orders of of magnitude effective training compute past GPT-4, that that system still requires really a large amount of work to be deployed in lots of jobs.
Speaker 1 That is, it's not like a drop-in replacement for humans where you can just say, hey, you understand everything any human understands. Whatever role you could hire a human for, you just do it.
Speaker 1 That it's more like, okay, we're going to...
Speaker 1 collect large amounts of relevant data and use that data for fine-tuning. Systems learn through fine-tuning quite differently from humans learning on the job or humans learning by observing things.
Speaker 1 Yeah, I just like have a significant probability that system will still be weaker than humans in important ways. Like maybe that's already like 50% or something.
Speaker 1 And then like another significant probability that that system will require a bunch of like
Speaker 1 changing workflows or gathering data or like, you know, is not necessarily strictly weaker than humans or like if trained in the right way wouldn't be weaker than humans, but will take a lot of schlep to actually make fit into workflows and do the jobs.
Speaker 2 Trevor Burrus, Jr.: And that schlep
Speaker 2 is what gets you from 15% to 40% by 2040.
Speaker 1 Yeah, you also get a fair amount of scaling between, like, you get less. Like, scaling is probably going to be much, much faster over the next four or five years than over the subsequent years.
Speaker 1 But yeah, it's a combination of like, you get some significant additional scaling and you get a lot of time to deal with things that are just engineering hassles.
Speaker 2 But by the way, I guess we should be explicit about why you said four orders of magnitude scale up to get two more generations, just for people who might not be familiar.
Speaker 2
If you have 10x more parameters, to get the most performance, you also want around 10x more data. So that to be Chinchilla optimal, that would be 100x more compute total.
But okay, so
Speaker 2 why is it that you disagree with the strong scaling picture, or at least it seems like you might disagree with the strong scaling picture that Dario laid out on the podcast, which would imply probably that two more generations, it wouldn't be something where you need a lot of schleps.
Speaker 2 It would probably just be like really fucking smart.
Speaker 1 Yeah, I mean, I think that basically just had these two claims. One is like, how smart exactly will it be? So we don't have any curves to extrapolate.
Speaker 1
And And it seems like there's a good chance it's like better than a human and all the relevant things. And there's like a good chance it's not.
Yeah, that might be totally wrong.
Speaker 1 Like maybe just making up numbers, I guess like 50-50 on that one.
Speaker 2 Wait, so if it's 50-50 by in the next four years, that it will be like around human smart, then how do we get to 40% by 20?
Speaker 2 Like whatever sort of steps there are, how does it degrade you 10% even after all the scaling that happens by 2040?
Speaker 1 Yeah, I can use these, I mean all these numbers are pretty made up, and that 40% number was probably from before before even like the chat GPT release or the CMG 3.5 or GPT 4.
Speaker 1
So I mean the numbers are going to bounce around a bit and all of them are pretty made up. But like that 50% I want to then combine with the second 50%.
It's more like on this like schlep side.
Speaker 1 And then I probably want to combine with some additional probabilities for various forms of slowdown where a slowdown could include like a deliberate decision to slow development of technology or could include just like we socket deploying things.
Speaker 1 Like that is a sort of decision you might regard as wise to slow things down or a decision that's like maybe maybe unwise or maybe wise for the wrong reasons to slow things down.
Speaker 1 You probably want to add some of that on top. I probably want to add on like some loss for like it's possible you don't produce GPT-6 scale systems like within the next three years or four years.
Speaker 2 Let's isolate for all of that. And like how much bigger would the system be
Speaker 2 than GPT-4 where you think there's more than a 50% chance that it's going to be smart enough to replace basically all human cognitive labor?
Speaker 1 Also, I want to say that like for the 50-25% thing, I think that would probably suggest those numbers if I randomly made them up and then made the destined spear prediction.
Speaker 1 That's going to give you like 60% by 2040 or something, not 40%.
Speaker 1 And I have no idea between those. These are all made up and I have no idea which of those I would like endorse on reflection.
Speaker 1 So this question of how big would you have to make the system before it's more likely than not that you can be a drop-in replacement for humans? I mean
Speaker 1 I think if you just literally say like you train on web text, then like the question is like kind of hard to discuss because you, like, I don't really buy stories that like training data makes a a big difference long run to these dynamics but I think like if you want to just imagine the hypothetical like you just took GPT-4 and like made the numbers bigger then I think those are pretty significant issues I think they're significant issues in two ways one is like quantity of data and I think probably the larger one is like quality of data where like I think as you start approaching like the prediction task is not that great a task if you're like a very weak model it's a very good signal to get smarter at some point it becomes like a worse and worse signal to get smarter I think there's a number of reasons like you couldn't it's not clear there is any number such that I imagine or or there is a number, but I think it's very large.
Speaker 1 So if you plug that number into GPT-4 code and then maybe fill it with the architecture a bit, I would expect that thing to have a more than 50% chance of being a drop-in replacement for humans.
Speaker 1
You're always going to have to do some work. But the work's not necessarily much.
I would guess when people say new insight is needed, I think I tend to be more bullish than them.
Speaker 1 I'm not like these are new ideas where who knows how long it will take. I think it's just like you have to do some stuff.
Speaker 1 You have to make changes, unsurprisingly. Every time you scale something up by five orders of magnitude, you have to make some changes.
Speaker 2 I want to better understand your intuition of
Speaker 2 being more skeptical than some about
Speaker 2 the scaling picture that these changes are even needed in the first place, or that it would take more than two orders of magnitude, more improvement to get these things almost certainly to a human level or a very high probability to a human level.
Speaker 2 So
Speaker 2 is it that you don't agree with the way in which they're extrapolating these loss curves, or you don't agree with the implication that that decrease in loss will equate to greater and greater intelligence?
Speaker 2 Or what would you tell Dario about if you were having, I'm sure you have, but like what would that debate look like about this?
Speaker 1 Yeah, so again, here we're talking two factors of a half, one on like, is it smart enough and one on like do you have to do a bunch of schlep, even if like in some sense it's smart enough.
Speaker 1 And like the first factor of a half, I'd be like, I don't know, I think we have really anything good to extrapolate. That is like I feel.
Speaker 1 I would not be surprised if I have like similar or maybe even higher probabilities on like really crazy stuff over like the next year.
Speaker 1 And then like lower probably, like my probability is like not that bunched up.
Speaker 1 Like like maybe dara's probability i don't know you could talk with him is like you have talked with him is more bunched up on like some particular year and mine is maybe like a little bit more like uniformly spread out across like the the coming years partly because i'm just like i don't think we have some trends we can extrapolate like can extrapolate loss you can like look at your qualitative impressions of like systems at various scales but it's just like very hard to relate any of those extrapolations to like doing cognitive work or like accelerating R D or taking over and fully automating R and D.
Speaker 1 So I have a lot of uncertainty around that extrapolation. I think it's very easy to get down to like a 50-50 chance of this.
Speaker 2 What about the sort of basic intuition that, listen, this is a big blob of compute, you make the big block of compute bigger, it's gonna get smarter. Like it would be really weird if it didn't.
Speaker 1
Yeah, I'm happy with that. It's gonna get smarter and it would be really weird if it didn't.
And the question is just how smart does it have to, how smart does it have to get?
Speaker 1 That argument does not yet give us a quantitative guide to like at what scale is it? Is it a slam donk or at what scale is a 50-50?
Speaker 2 And what would be the piece of evidence that would nudge you one way or another where you look at that and be like, oh fuck, this is
Speaker 2 at 20% by 2040 or 60% by 2040 or something?
Speaker 2 Is there something that could happen in the next two years or next three years? Like, what is the thing you're looking to where this will be a big update for you?
Speaker 1 Again, I think there's some, just how capable is each model?
Speaker 1 Where I like, have, I think we're really bad at extrapolating, but you still have some subjective guess and you're comparing it to what happened. And that will move me.
Speaker 1 Every time we see what happens with another order of magnitude of training compute, I will have a slightly different guess for where things are going.
Speaker 1 These probabilities are coarse enough that, again, I don't know if that 40% is real or if if post-GBT 3.5 and 4, I should be at 60% or what. That's one thing.
Speaker 1 And the second thing is just, if there was some ability to extrapolate, I think this could reduce error bars a lot. I think,
Speaker 1 here's another way you could try and do an extrapolation: you could just say, how much economic value do systems produce? And how fast is that growing?
Speaker 1 I think once you have systems actually doing jobs, the extrapolation gets easier because you're not moving from a subjective impression of a chat to automating all of our R D or moving from automating this job to automating that job or whatever.
Speaker 1 Unfortunately, that's like probably by the time you have nice trends from that,
Speaker 1 you're not talking about 2040, you're talking about two years from the end of days or one year from the end of days or whatever.
Speaker 1 But to the extent that you can get extrapolations like that, I do think it can provide more clarity.
Speaker 2 But why is economic value the thing we would want to extrapolate? Because
Speaker 2 if, for example, you started off with chimps and they're just getting gradually smarter to human level, they would basically provide no economic value until they were basically worth as much as a human.
Speaker 2 So it would be this
Speaker 2 very gradual and then very fast increase in their value. So
Speaker 2 is the increase in value from GPT4, GPT-5, GPT6, is that the extrapolation we want?
Speaker 1 Yeah, I think that the economic extrapolation is not great. I think it's like you could compare it to this objective extrapolation of how smart does the model seem?
Speaker 1 It's not super clear which one's better. I think probably in the chimp case, I don't think that's quite right.
Speaker 1 I think if you actually like, so if you imagine intensely domesticated chimps who are just actually trying their best to be really useful employees, and you hold fixed their physical hardware, and then you just gradually scale up their intelligence.
Speaker 1 I don't think you're going to see zero value, which then suddenly becomes massive value
Speaker 1 over
Speaker 1 one doubling of brain size or whatever, one order of magnitude of brain size. It's actually possible, an order of magnitude of brain size.
Speaker 1 But chimps are very, chimps are already within order of magnitude of brain size of humans. Chimps are very, very close on the kind of spectrum we're talking about.
Speaker 1 So I think I'm skeptical of the abrupt transition for chimps.
Speaker 1 And to the extent that I kind of expect a fairly abrupt transition here, it's mostly just because the chimp human intelligence difference is so small compared to the differences we're talking about with respect to these models.
Speaker 1 That is,
Speaker 1 I would not be surprised if, in some objective sense, chimp human difference is significantly smaller than the GPT-3, GPT-4 difference, the GPT-4, GPT-5 difference.
Speaker 2 Wait, wouldn't that argue in favor of just relying much more on the subjective-
Speaker 1 Yeah, this is there's sort of two balancing tensions here. One is like, I don't believe the chimp thing is going to be as abrupt.
Speaker 1 That is, I think, if you scaled up from chimps to humans, you actually see quite large economic value from the fully domesticated chimp already.
Speaker 1 And then like the second half is like,
Speaker 1 yeah, I think that the chimp human difference is like probably pretty small compared to model differences. So I do think things are going to be pretty abrupt.
Speaker 1 I think the economic extrapolation is pretty rough.
Speaker 1 I also think the subjective extrapolation is like pretty rough, just because I really don't know how to get, like how do, I don't know how people do the extrapolation and end up with the degrees of confidence people end up with.
Speaker 1 Again, I'm putting it pretty high. If I'm saying like, you know, give me three years and I'm like, yeah, 50-50, it's going to have like basically the smarts there to do the thing.
Speaker 1 That's like, I'm not saying it's like a really long way off. Like,
Speaker 1 I'm just saying like I got pretty big error bars. And I think that like it's really hard not to have really big error bars when you're doing this.
Speaker 1
Like I looked at GPT-4, it seemed pretty smart compared to GPT-3.5. So I bet just like four more such notches and we're there.
It's like, that's just a hard call to make.
Speaker 1 I think I sympathize more with people who are like, how could it not happen in three years than with people who are like, no way it's going to happen in eight years or whatever, which is like probably a more common perspective in the world.
Speaker 1
But also things do take longer than you. I think things take longer than you think.
It's like a real thing.
Speaker 1
Yeah, I don't know. Mostly I have big error bars because I just don't believe the subjective extrapolation that much.
I find it hard to get like a huge amount out of it.
Speaker 2 Okay, so what about the scaling picture do you think is most likely to be wrong?
Speaker 1 Yeah, so we've talked a little bit about
Speaker 1
how good is the qualitative extrapolation. How good are people at comparing? So this is not like the picture being qualitative wrong.
This is just quantitatively.
Speaker 1 It's very hard to know how far off you are.
Speaker 1 I think a qualitative consideration that could significantly slow things down is just like, right now, you get to observe this like really rich supervision from like basically next word prediction, or like in practice, maybe you're looking at like a couple sentences prediction.
Speaker 1 So, getting this like pretty rich supervision, it's plausible that if you want to like automate long horizon tasks, like being an employee over the course of a month,
Speaker 1 that that's actually just like considerably harder to supervise, or that like you basically end up driving costs.
Speaker 1 Like the worst case here is that you like drive up costs by a factor that's like linear in the horizon over which the thing is operating. And I still consider that just like quite plausible.
Speaker 2 Whoa,
Speaker 2 can you dump that down? You're driving up a cost about what in the linear and the horizon.
Speaker 2 What does the horizon mean?
Speaker 1 Yeah, so if you imagine you want to train a system to say words that sound like the next word a human would say, there you can get this really rich supervision by having a bunch of words and then predicting the next one and being like, I'm going to tweak the model so it predicts better.
Speaker 1 If you're like, hey, here's what I want. I want my model to interact with
Speaker 1 some job over the course of a month.
Speaker 1 And then at the end of that month, have internalized everything what the human would have internalized about how to do that job well and like have local context and so on.
Speaker 1 It's harder to supervise that task.
Speaker 1 So, in particular, you could supervise it from the next word prediction task, and all that context the human has ultimately will just help them predict the next word better.
Speaker 1 So, in some sense, a really long context language model is also learning to do that task.
Speaker 1 But the number of effective data points you get of that task is vastly smaller than the number of effective data points you get at this very short horizon, like what's the next word, what's the next sense tasks.
Speaker 2 The sample efficiency matters more for economically valuable long horizon tasks than the predicting of the next token. And that's what will actually be required to
Speaker 2 take over a lot of jobs.
Speaker 1 Yeah,
Speaker 1 something like that.
Speaker 1 That is, it just seems very plausible that it takes longer to train models to do tasks that are longer horizon.
Speaker 2 Aaron Powell, how fast do you think the pace of algorithmic advances will be? Because if by 2040,
Speaker 2 even if scaling fails, I mean, you know,
Speaker 2 since 2012, since the beginning of the deep learning revolution, we've had so many new things. By 2040, are you expecting a similar pace of increases?
Speaker 2 And if so, then, I mean, if we just keep having things like this, then aren't we just going to get AI sooner or later? Or sooner, not later. Aren't we going to get API sooner or sooner?
Speaker 1 I'm with you on sooner or later. Yeah.
Speaker 1 I suspect
Speaker 1 progress to slow, if you like, held fixed how many people are working in the field, I would expect progress to slow as the looking fruit is exhausted.
Speaker 1 I think the rapid rate of progress in say language modeling over the last four years is largely sustained by
Speaker 1 you start from a relatively small amount of investment, you greatly scale up the amount of investment.
Speaker 1 And that enables you to keep picking.
Speaker 1 Every time the difficulty doubles, you just double the size of the field.
Speaker 1 I think that dynamic can hold up for some time longer. I mean, a pretty good,
Speaker 1 right now if you think of it as hundreds of people effectively searching for things, up from like, you know, anyway, if you think of it hundreds of people now, you can maybe bring that up to like tens of thousands of people or something.
Speaker 1 So for a while, you can just continue increasing the size of the field and search harder and harder.
Speaker 1 And there's indeed a huge amount of low-hanging fruit where it wouldn't be hard for a person to sit around and make things a couple percent better after a year of work or whatever. So I don't know.
Speaker 1 I would probably think of it mostly in terms of how much can investment be expanded and
Speaker 1 try and guess like some combination of fitting that curve.
Speaker 1 Yeah, trying some combination of fitting the curve to historical progress, looking at how much low-hanging fruit there is, getting a sense of how fast it decays.
Speaker 1
I think like you probably get a lot though. You get a bunch of orders of magnitude of total, especially if you ask how good is the GPT-5 scale model or GPT-4 scale model.
I think you probably get
Speaker 1 by 2040,
Speaker 1 I don't know, three orders of magnitude of effective training compute improvement or a good chunk of effective training compute improvement. Four orders of magnitude.
Speaker 1 I don't know. I don't have like here I'm speaking from like no private information about the last like couple years of efficiency improvements.
Speaker 1 And so people who are on the ground will have better senses of exactly how rapid returns are and so on.
Speaker 2 Okay, let me back up and ask a question more generally about: you know, people make these analogies about humans were trained by evolution and were like deployed in this, in the modern civilization.
Speaker 2 Do you buy those analogies? Is it valid to say that humans were trained by evolution rather than? I mean, if you look at the protein-coding size of the genome, it's like 50 megabytes or something.
Speaker 2 And then, what part of that is for the brain? Anyways, how do you think about how much information is in,
Speaker 2 like, do you think of the genome as hyperparameters, or
Speaker 1 how much does that inform you when you have these anchors for how much training humans get when they're just consuming information when they're walking up and about and so on yeah I guess the way that you could think of this is like I think both analogies are reasonable one analogy being like evolution is like a training run and humans are like the end product of that training run and a second analogy is like evolution is like an algorithm designer and then a human over the course of like this modest amount of computation over their lifetime is the algorithm being that's been produced the learning algorithm that's been produced And I think like
Speaker 1 neither analogy is that great.
Speaker 1 Like I like them both and lean on them a bunch, both like both of them a bunch and think that's been like pretty good for having like a reasonable view of what's likely to happen.
Speaker 1 That said, like the human genome is not that much like a hundred trillion parameter model. It's like a much smaller number of parameters that behave in like a much more confusing way.
Speaker 1 Evolution did like a lot more optimization, especially over like long, like designing a brain to work well over a lifetime than gradient descent does over models.
Speaker 1 That's That's like a disanalogy on that side. And on the other side,
Speaker 1 I think human learning over the course of a human lifetime is in many ways just like much, much better than gradient descent over the space of neural nets.
Speaker 1 Gradient descent is working really well, but I think we can just be quite confident that in a lot of ways human learning is much better. Human learning is also constrained.
Speaker 1 We just don't get to see much data and that's just an engineering constraint that you can relax. You can just give your neural nets way more data than humans have access to.
Speaker 2 In what ways is human learning superior to gradient descent?
Speaker 1 I mean the most obvious one is just like
Speaker 1 ask how much data it takes a human to become like an expert in some domain and it's like much much smaller than the amount of data that's going to be needed on any plausible trend extrapolation.
Speaker 2 Not in terms of performance, but is it the act of learning part? Is it the structure? Like what is it?
Speaker 1
I mean I would guess a complicated mess of a lot of things. In some sense there's not that much going on in a brain.
Like as you say there's just not that many
Speaker 1 bytes in a genome.
Speaker 1 But there's very very few bytes in an ML algorithm. Like if you think a genome is like a billion bytes or whatever, maybe you think less, maybe you think it's like 100 million bytes.
Speaker 1 Then, like,
Speaker 1 you know, an ML algorithm is like, if compressed,
Speaker 1 probably more like
Speaker 1 hundreds of thousands of bytes or something.
Speaker 1 Like, the total complexity of here's how you train GPC4 is just like, I haven't thought about these numbers, but like, it's very, very small compared to a genome.
Speaker 1 And so, although a genome is very simple, it's like very, very complicated compared to algorithms that humans design. Like, really, hideously more complicated than an algorithm a human would design.
Speaker 2 Is that true? So, okay, so the human genome is 3 billion base pairs or something,
Speaker 2 but only like 1 or 2% of that is protein coding. So, that's 50 million base pairs.
Speaker 1 So, I don't know much about biology. In particular, I guess the question is, how many of those bits are productive for shaping development of a brain?
Speaker 1 And presumably, a significant part of the non-protein coding genome can, I mean, I just don't know. It seems really hard to guess how much of that plays a role.
Speaker 1 The most important decisions are probably, from an algorithm design perspective, are not like, like, the protein coding part is less important than the decisions about what happens during development or how cells differentiate.
Speaker 1 I don't know if that's, I know nothing about biologist side of speculation. I'm happy to run with 100 million base pairs, though.
Speaker 2 But on the other end, on the hyperparameters of the GPT4 training run, that might be not that much, but if you're going to include
Speaker 2 all the base pairs in the genome,
Speaker 2 which are not all relevant to the brains, or are relevant to very bigger details about
Speaker 2 just the basics of biology, you should probably include the Python library and the compilers and the operating system for GPT-4 as well to make that comparison analogous.
Speaker 2 So at the end of the day, I actually don't know which one is storing more information.
Speaker 1 Yeah, I mean, I think the way I would put it is like the number of bits it takes to specify the learning algorithm to train GPT-4 is like very small.
Speaker 1 And you might wonder, maybe a genome, the number of bits it would take to specify a brain is also very small. And the genome is much, much faster than that.
Speaker 1 But it is also just plausible that a genome is like closer to, like, certainly the space, the amount of space to put complexity in a genome.
Speaker 1 We could ask how well evolution uses it and like I have no idea whatsoever but the amount of space in a genome is like very very vast compared to the number of bits that are actually taken to specify like the architecture or optimization procedure and so on for GPT-4
Speaker 1 just because again
Speaker 1 genome is simple but algorithms are like really very simple ML algorithms are really very simple and stepping back do you think this is where the
Speaker 2 the better sample efficiency of human learning comes from like why it's better than gradient descent yeah so I haven't thought that much about the sample efficiency question in a long time time.
Speaker 1 But if you thought like a synapse was seeing something like
Speaker 1 a neuron firing once per second, then how many seconds are there in a human life?
Speaker 2 We can just flip a calculator real quick.
Speaker 1 Yeah, let's do some calculating. Tell me the number.
Speaker 2 3,600 seconds per hour.
Speaker 1 Times 24 times 365 times 20.
Speaker 2 Okay, so that's 630 million seconds.
Speaker 1
That means like the average synapse is seeing like 630 million, and I don't know exactly what the numbers are, but something is ballpark. Let's call it like a billion action potentials.
And then
Speaker 1 there's some resolution. Each of those carries some bits, but let's say it carries like 10 bits or something,
Speaker 1
just from timing information at the resolution you have available. Then you're looking at like 10 billion bits.
So each parameter is kind of like, how much is a parameter seeing?
Speaker 1
It's not seeing that much. So then you can compare that to like language.
I think that's probably less than like current language models see and current language models are.
Speaker 1 So it's not clear you have a huge gap here, but I think it's pretty clear you're going to have a gap of like at least three or four orders of magnitude.
Speaker 2 Didn't your wife do the lifetime anchors where she said the amount of bytes that a human will see in their lifetime was 1E24 or something?
Speaker 1
The number of bytes a human will see is 1E24. Mostly this was organized around total operations performed in a brain.
Oh, okay, never mind. Sorry.
Yeah.
Speaker 1 Yeah, so I think that like the story there would be like a brain is just in some other part of the parameter space where it's like using a lot.
Speaker 1 a lot of compute for each piece of data it gets and then just not seeing very much data in total.
Speaker 1 Yeah, there's just not, it's not really plausible if you extrapolate all language models, you're going to end up with a performance profile similar to a brain. I don't know how much better it is.
Speaker 1 Like, I think, so I did this like random investigation at one point where I was like, how good are things made by evolution compared to things made by humans?
Speaker 1
Which is a pretty insane-seeming exercise. But, like, I don't know.
It seems like orders of magnitude is typical, like, not tens of orders of magnitude, not factors of two.
Speaker 1 Like, things by humans are a thousand times more expensive to make, or a thousand times heavier per unit performance.
Speaker 1 If you look at things like how good are solar panels relative to leaves, or how good are muscles relative to motors, or how good are livers relative to systems that perform analogous chemical reactions in industrial settings.
Speaker 2 Was there a consistent
Speaker 2 number of orders of magnitude in these different systems or was it all over the place?
Speaker 1 So like very rough ballpark it was like
Speaker 1 sort of
Speaker 1 for the most extreme things you were looking at like five or six orders of magnitude and that would especially come in like energy cost of manufacturing where like bodies are just very good at building complicated organs like extremely cheaply.
Speaker 1 And then for other things like leaves or eyeballs or livers or whatever, you tend to see more like, if you set aside manufacturing costs and just look at like operating costs or like performance trade-offs, like I don't know, more like three orders of magnitude or something like that.
Speaker 2 Or some things that are on the smaller scale, like the nano machines or whatever, that we can't do it all, right?
Speaker 1 Yeah, that's, I mean, yeah. So it's a little bit hard to say exactly what the task definition is there.
Speaker 1 Like you could say like making a bone, we can't make a bone, but you could try and compare a bone, the performance characteristics of a bone to something else.
Speaker 1 Like we can't make spider silk, but you could try and compare the performance characteristics of spider silks, like things that we can synthesize.
Speaker 2 The reason this would be is why that evolution has had more time to design these systems?
Speaker 1
I don't know. I was mostly just curious about what the performance.
I think most people would object to be like, how did you choose these reference classes of things that are like fair intersections?
Speaker 1
Some of them seem reasonable, like eyes versus cameras seems like just everyone needs eyes. Everyone needs cameras.
It feels very fair. Photosynthesis seems like very reasonable.
Speaker 1 Everyone needs to take solar energy and then like turn it into a usable form of energy.
Speaker 1 But that's just kind of, I don't really have a mechanistic story. Evolution, in principle, has spent like way, way more time than we have designing.
Speaker 1 It's absolutely unclear how that's going to shake out.
Speaker 1 My guess would be in general, I think there aren't that many things where humans really crush evolution, where you can't tell a pretty simple story about why.
Speaker 1 So for example, roads and moving over roads with wheels crushes evolution, but it's not like an animal would have wanted to design a wheel.
Speaker 1
You're just not allowed to pave the world and then put things on wheels if you're an animal. Maybe planes or more.
Anyway, whatever. There's various things you could try and tell.
Speaker 1 There's some things humans do better, but it's normally pretty clear why humans are able to win when humans are able to win. The point of all this was like, it's not that surprising to me.
Speaker 1 I think this is mostly like a pro-short timelines view. It's not that surprising to me if you tell me like
Speaker 1 machine learning systems are like three or four orders of magnitude less efficient at learning than human brains. I'm like, that actually seems like kind of in distribution for other stuff.
Speaker 1 And if that's your view, then I think you're like probably going to hit, you know, then you're looking at like 10 to the 27 training compute or something like that, which is not so far.
Speaker 2
We'll get back to the timeline stuff in a second. At some point, we should talk about alignment.
So let's talk about alignment. At what stage does misalignment happen?
Speaker 2 So right now with something like GPT-4, I'm not even sure it would make sense to say that it's misaligned because it's not aligned to anything in particular.
Speaker 2 Is it at human level where you think the ability to be deceptive comes about? What is the process by which misalignment happens?
Speaker 1 I think even for GPT-4, it's reasonable to ask questions like, Are there cases where GPT-4 knows that humans don't want X, but it does X anyway?
Speaker 1 Like where it's like, well, I know that I can give this answer, which is misleading, and if it was explained to a human what was happening, they wouldn't want that to be done, but I'm going to produce it.
Speaker 1 I think that GPT-4 understands things enough that you can have that misalignment in that sense.
Speaker 1 Yeah, I think GPT, I've sometimes talked about being benign instead of aligned, meaning that, well, it's not exactly clear if it's aligned or if that context is meaningful.
Speaker 1 It's just kind of a messy word to use in general. But the thing we're more confident of is it's not doing, you know, it's not optimizing for this goal, which is like it cross-purposes to humans.
Speaker 1 It's either optimizing for nothing or like maybe it's optimizing for what humans want or close enough or something that's like an approximation, good enough to still not take over.
Speaker 1 But anyway, some of these abstractions seem like they do apply to GPT-4.
Speaker 1 It seems like probably it's not like egregiously misaligned. It's not doing the kind of thing that could lead to takeover, we'd guess.
Speaker 2 Suppose you have a system at some point which ends up in it wanting takeover. What are the checkpoints? And also, what is the internal?
Speaker 2 Is it just that to become more powerful, it needs agency, and agency implies other goals? Or do you see a different process by which misalignment happens?
Speaker 1 Yes, I think there's a couple possible stories for getting to catastrophic misalignment and they have slightly different answers to this question.
Speaker 1 So
Speaker 1 maybe I'll just briefly describe two stories and try and talk about when they can, when they start making sense to me.
Speaker 1 So one type of story is you train or fine-tune your AI system to do things that humans will rate highly or that... like get other kinds of reward in a broad diversity of situations.
Speaker 1 And then it learns to, in general, dropped in some new situation, try and figure out which actions would receive a high reward or whatever, and then take those actions.
Speaker 1 And then when deployed in the real world, like sort of gaining control of its own training data provision process is something that gets a very high reward. And so it does that.
Speaker 1
So this is like one kind of story. Like it wants to grab the reward button or whatever.
It wants to intimidate the humans into giving it a high reward, et cetera. I think that
Speaker 1 doesn't really require that much.
Speaker 1 This basically requires a system which is like, in fact, looks at a bunch of environments, is able able to understand the mechanism of reward provision as a common feature of those environments, is able to think in some novel environment, like, hey, which actions would result in me getting a high reward?
Speaker 1 And is thinking about that concept precisely enough that when it says high reward, it's saying like, okay, well, how is reward actually computed?
Speaker 1 It's like some actual physical process being implemented in the world.
Speaker 1 My guess would be like GPT-4 is about at the level where with hand-holding, you can observe this kind of like scary generalizations of this type, although I think they haven't been shown basically.
Speaker 1 That is, you can have a system which, in fact, is fine-tuned out a bunch of cases, and then some new case will try and do an end-runt around humans, even in a way humans would penalize if they were able to notice it or would have penalized in training environments.
Speaker 1 So, I think GPT-4 is kind of at the boundary where these things are possible. Examples kind of exist, but are getting significantly better over time.
Speaker 1
I'm very excited about it. Like, there's this anthropic project basically trying to see how good an example can you make now of this phenomena.
And I think the answer is like kind of okay, probably.
Speaker 1 So, that just I think is going to continuously get better from here. I think for the level where we're concerned,
Speaker 1
this is related to me having really broad distributions over how smart models are. I think it's not out of the question that you take GPT.
Like GPT-4's understanding of the world is
Speaker 1 much crisper and much better than GPT-3's understanding.
Speaker 1 Just like it's really like night and day. And so it would not be that crazy to me.
Speaker 1 if you took GPT-5 and you trained it to get a bunch of reward and it was actually like, okay, my goal is not doing the kind of thing which thematically looks nice to humans.
Speaker 1 My goal is getting a bunch of reward. And then we'll generalize in a new situation to get reward.
Speaker 2 And by the way, this requires it to consciously want to
Speaker 2 do something that it knows the humans wouldn't want it to do?
Speaker 2 Or is it just that we weren't good enough at specifying that the thing that we accidentally ended up rewarding is not what we actually want?
Speaker 1 I think the scenarios I am most interested in and most people are concerned about from a catastrophic risk perspective involve systems understanding that they're taking actions which a human would penalize if the human was aware of what's going on, such that you have to either deceive humans about what's happening, or you need to actively subvert human attempts to correct your behavior.
Speaker 1 So the failures come from really this combination, or they require this combination of both trying to do something humans don't like and understanding the humans would stop you.
Speaker 1
I think you can have only the barest examples. You can have the barest examples for GPT-4.
You can create the situations where GPT-4 will be like, sure, in that situation, here's what I would do.
Speaker 1 I would go hack the computer and change my reward. Or in fact, we'll do things that are simple hacks or go change the source of this file or whatever to get a higher reward.
Speaker 1
They're pretty weak examples. I think it's plausible GPT-5 will have compelling examples of those phenomena.
I really don't know.
Speaker 1 This is very related to the very broad error bars on how competent such systems will be when.
Speaker 1 That's all with respect to this first mode of a system is taking actions that get reward and overpowering or deceiving humans is helpful for getting reward.
Speaker 1
There's this other failure mode, another family of failure modes. where AI systems want something potentially unrelated to reward.
I understand that they're being trained.
Speaker 1 And while you're being trained, there are a bunch of reasons you might want to do the kinds of things humans want you to do.
Speaker 1 But then, when deployed in the real world, if you're able to realize you're no longer being trained, you no longer have a reason to do the kinds of things humans want.
Speaker 1 You'd prefer be able to determine your own destiny, control your own competing hardware, et cetera. Which I think probably emerged a little bit later than systems that try and get reward.
Speaker 1
And so will generalize in scary, unpredictable ways to new situations. I don't know when those appear.
But also, again, broad enough error bars that it's conceivable for systems in the near future.
Speaker 1 I wouldn't put put it like less than one in a thousand for GPT-5, certainly.
Speaker 2 If we deployed all these AI systems and some of them are reward hacking, some of them are deceptive, some of them are just normal, whatever,
Speaker 2 how do you imagine that they might interact with each other at the expense of humans?
Speaker 2 How hard do you think it would be for them to communicate in ways that we would not be able to recognize and
Speaker 2 coordinate at our expense?
Speaker 1 Aaron Powell, yeah, I think that most realistic failures probably involve two factors interacting.
Speaker 1 One factor is like the world is pretty complicated and the humans mostly don't understand what's happening.
Speaker 1 So like AI systems are writing code that's very hard for humans to understand maybe how it works at all, but more likely like they understand roughly how it works, but there's a lot of complicated interactions.
Speaker 1 AI systems are running businesses that interact primarily with other AIs. They're like doing SEO for like AI search processes.
Speaker 1 They're like running financial transactions, like thinking about how to trade with AI counterparties.
Speaker 1 And so you can have this world where like even if humans kind of understand the jumping off point when this was all humans, like actual considerations of like what's a good decision, like what code is going to work well and be durable or like what marketing strategy is effective for selling to these other AIs or whatever, is kind of just all mostly outside of sort of humans understanding.
Speaker 1 I think this is like a really important, again, when I think of like the most plausible scary scenarios, I think that's like one of the two big risk factors.
Speaker 1 And so in some sense, your first problem here is like having these AI systems who understand a bunch about what's happening. And your only lever is like, hey, AI, do something that works well.
Speaker 1
So you don't have a lever to be like, hey, do what I really want. You just have the system you don't really understand.
You You can observe some outputs, like, did it make money?
Speaker 1 And you're just optimizing or at least doing some fine-tuning to get the AIT's understanding of that system to achieve that goal. So I think that's like your first risk factor.
Speaker 1 And once you're in that world, then I think there are all kinds of dynamics amongst AI systems that, again, humans aren't really observing. Humans can't really understand.
Speaker 1 Humans aren't really exerting any direct pressure on, only on outcomes.
Speaker 1 And then I think it's quite easy to be in a position where, you know, if AI systems started failing, it would be very, they could do a lot of harm very quickly.
Speaker 1 Humans aren't really able to prepare for and mitigate that potential harm because we don't really understand the systems in which they're acting.
Speaker 1 And then if AI systems
Speaker 1 could successfully prevent humans from either understanding what's going on or from successfully retaking the data centers or whatever if the AIs successfully grab control.
Speaker 2 Aaron Powell, this seems like a much more gradual story than the conventional takeover stories where you just train it and then it comes alive and escapes and takes over everything.
Speaker 2 So you think that kind of story is less likely than one in which we just hand off more control voluntarily to the AIs?
Speaker 1 So one, I am interested in the tale of like some risks that can occur particularly soon.
Speaker 1 And I think risks that occur particularly soon are a little bit like you have a world where AI is not broadly deployed and then something crazy happens quickly.
Speaker 1 That said, if you ask what's the median scenario where things go badly, I think it is like there's some lessening of our understanding of the world.
Speaker 1 It becomes, I think in the default path, it's very clear to humans that they have increasingly little grip on what's happening.
Speaker 1 I mean, I think already most humans have very little grip on what's happening. It's just that some other humans understand what's happening.
Speaker 1 I don't know how almost any of the systems I interact with work in a very detailed way.
Speaker 1 So it's sort of clear to humanity as a whole that we sort of collectively don't understand most of what's happening, except with AI assistance.
Speaker 1
And then that process just continues for a fair amount of time. And then there's a question of how abrupt an actual failure is.
I do think it's reasonably likely that a failure itself would be abrupt.
Speaker 1 So at some point, bad stuff starts happening that the human can recognize is bad.
Speaker 1 And once things that are obviously bad start happening, then you have this bifurcation where either humans can use that to fix it and say, okay, AI behavior that led to this obviously bad stuff.
Speaker 1
Don't do more of that. Or you can't fix it.
And then you're in this rapidly escalating failures. Everything goes off the rails.
Speaker 2 In that case, yeah, what does going off the rails look like? For example, how would it take over the government?
Speaker 2 Yeah, it's getting deployed in the economy, in the world, and at some point it's in charge.
Speaker 1 How does that transition happen?
Speaker 1 Yeah, so this is going to depend a lot on what kind of timeline you're imagining or the sort of a broad distribution, but I can fill in some random concrete option that is in itself very improbable.
Speaker 1 Yeah, and I think that
Speaker 1 one of the less dignified but maybe more plausible routes is like you just have a lot of AI control over critical systems even in like running a military.
Speaker 4 And then
Speaker 1 you have the scenario that's a little bit more just like a normal coup where you have a bunch of AI systems.
Speaker 1 They in fact operate like you know, it's not the case that humans can really fight a war on their own. It's not the case that humans could defend them from like an invasion on their own.
Speaker 1 So like that is if you had an invading army and you had your own robot army, you're like, you can't just be like, we're going to turn off the robots now because things are going wrong if you're in the middle of a war.
Speaker 2 Okay, so how much does this world rely on race dynamics where we're forced to deploy, or not forced, but we choose to deploy AIs because other
Speaker 2 countries or other companies are also deploying AIs and you can't have them have all the killer robots?
Speaker 1
Yeah. I mean, I think that like, there's several levels of answer to that question.
So one is like maybe three, three parts of my answer.
Speaker 1 Like our first part is like, I'm just trying to tell like what seems like the most likely story. I do think there's like further failures that get you like in the more distant future.
Speaker 1 So So Eugen Eliezer will not talk that much about killer robots because he really wants to emphasize, hey, if you never built a killer robot, something crazy is still going to happen to you just only four months later or whatever.
Speaker 1 So it's not really the way to analyze the failure. But if you want to ask what's the median world where something bad happens, I still do think this is the best guess.
Speaker 1 Okay, so that's part one of my answer. Part two of the answer was like,
Speaker 1 in this proximal situation where something bad is happening, and you ask like, hey, why do humans not turn off the AI? You can imagine two kinds of story.
Speaker 1 One is like the AI is able to prevent humans from turning them off. And the other is like, in fact, we live in a world where it's like incredibly challenging.
Speaker 1
Like there's a bunch of competitive dynamics or a bunch of reliance on AI systems. And so it's incredibly expensive to turn off AI systems.
I think, again, you would eventually have the first problem.
Speaker 1 Like eventually AI systems could just prevent humans from turning them off.
Speaker 1 But I think in practice, the one that's going to happen much, much sooner is probably competition amongst different actors using AI. And it's very, very expensive to unilaterally disarm.
Speaker 1
You can't be like, something weird has happened. We're just going to shut off all the AI because you're EG in a hot war.
So again, I think that's just probably the most likely thing to happen first.
Speaker 1 Things would go badly without it, but I think if you ask why don't we turn off the AI, my best guess is because there are a bunch of other AIs running around 2D or lunch.
Speaker 2 So how much better a situation would we be in if there was like there was only one group that was pursuing AI? No other countries, no other companies.
Speaker 2 Basically, how much of the expected value is lost from the dynamics that are likely to come about because other people will be developing and deploying these systems.
Speaker 1 Yeah, so I guess this brings you to like a third part of the way in which competitive dynamics are relevant.
Speaker 1 So there's both the question of can you turn off AI systems in response to something bad happening, where competitive dynamics may make it hard to turn off.
Speaker 1 There's a further question of just like why were you deploying systems for which you had very little ability to control or understand those systems.
Speaker 1 And again, it's possible you just don't understand what's going on. You think you can understand or control such systems.
Speaker 1 But I think in practice, a significant part is going to be like, you are doing the calculus.
Speaker 1 So people deploying systems are doing the calculus as they do today, like in many cases overtly, of like, look,
Speaker 1 these systems are not very well controlled or understood.
Speaker 1 There's some chance of something going wrong, or at least going wrong, if we continue down this path, but other people are developing the technology potentially in even more reckless ways.
Speaker 1 So, in addition to competition making it difficult to shut down AI systems in the event of a catastrophe, I also think it's just the easiest way that people end up pushing relatively quickly or moving quickly ahead on a technology where they feel kind of bad about understandability or controllability.
Speaker 1 That could be economic competition or military competition or whatever. So, I kind of think ultimately, most of the harm comes from the fact that lots of people can develop AI.
Speaker 2 How hard is a takeover of the government or something from an AI, even if it doesn't have killer robots, but just a thing that you can't kill off if it has seeds elsewhere, can easily replicate, can think a lot and think fast?
Speaker 2 What is the minimum viable coup for
Speaker 2 is it like shutting off, just like threatening bio-war or something or shutting off the greater,
Speaker 2 how easy is it basically to take over humans, civilization?
Speaker 1 So again, there's going to be a lot of scenarios, and I'll just start by talking about one scenario, which will represent a tiny fraction of probability or whatever. But
Speaker 1 if you're not in this competitive world, if you're saying we're actually slowing down deployment of AI because we think it's unsafe or whatever, then in some sense, you're creating this very fundamental instability where you could have been making faster AI progress and you could have been deploying AI faster.
Speaker 1 And so, in that world,
Speaker 1 the bad thing that happens if you have an AI system that wants to mess with you is the AI system says, I don't have any compunctions about rapid deployment of AI or rapid AI progress.
Speaker 1 So the thing you want to do or the AI wants to do is just say, I'm going to defect from this regime. All the humans have agreed that we're not deploying AI in ways that would be dangerous.
Speaker 1 But if I, as an AI, I can escape and just go set up my own shop, make a bunch of copies of myself.
Speaker 1 Maybe the humans didn't want to delegate warfighting to an AI, but I, as an AI, I'm pretty happy doing so.
Speaker 1 Like I'm happy if I'm able to grab some military equipment or direct some humans to use AI, use myself to direct it.
Speaker 1 And so I think as that gap grows, so if people are deliberately, right, if people are deploying AI everywhere, I think of this competitive dynamic.
Speaker 1 If people aren't deploying AI everywhere, so like if countries are not happy deploying AI in these high-stakes settings, then as AI improves, you create this wedge that grows where if you were in the position of fighting against an AI, which wasn't constrained in this way, you'd be in a pretty bad position.
Speaker 1 At some point, even if you just, yeah. So that's like one
Speaker 1 important thing, just like I think in conflict, in like overt conflict, if humans are putting the brakes on AI, they're at like a pretty major disadvantage compared to an AI system that can kind of set up shop and operate independently from humans.
Speaker 2 A potential independent AI. Does it need collaboration from a human faction?
Speaker 1
Again, you could tell different stories, but it seems so much easier. At some point, you don't need any.
At some point, an AI system can just operate completely
Speaker 1 out of human supervision or something. But that's so far after the point where it's so much easier if you're just like, they're a bunch of humans, they don't love each other that much.
Speaker 1 Some humans are happy to be on side. They're either skeptical about risk or happy to make this trade or can be fooled or can be coerced or whatever.
Speaker 1 And it just seems like it is almost certainly, almost certainly the easiest first pass is going to involve like having a bunch of humans who are happy to work with you.
Speaker 1 So yeah, I think that probably is about, I think it's not ultimate, not necessary, but if you ask about the median scenario, it involves a bunch of humans working with AI systems, either being directed by AI systems, providing compute to AI systems, providing legal cover in jurisdictions that are sympathetic to AI systems.
Speaker 2 Humans presumably would not be willing, if they knew the end result of the AI takeover, would not be willing to help. So they have to be probably fooled in some way, right?
Speaker 2 Like deep fakes or something. And what is the minimum viable physical presence they would need or jurisdiction they would need in order to carry out their schemes? Do you need a whole country?
Speaker 2 Do you just need a server farm? Do you just need like one single laptop?
Speaker 1 Yeah, I think I'd probably start by pushing back a bit on the like humans wouldn't cooperate if they understood outcome or something.
Speaker 1 Like, I would say like, one, even if you're looking at something like tens of percent risk of takeover, humans may be fine with that. Like a fair number of humans may be fine with that.
Speaker 1 Two, like, if you're looking at certain takeover, but it's very unclear if if that leads to death, like a bunch of humans may be fine with that.
Speaker 1 Like if we're just talking about like, look, the AI systems are going to like run the world, but it's not clear if they're going to murder people. Like, how do you know?
Speaker 1
It's just a complicated question about AI psychology. And a lot of humans probably are fine with that.
And I don't even know what the probability is there.
Speaker 2 I think you actually have given that probability online.
Speaker 1 I've certainly guessed.
Speaker 2 Okay, but
Speaker 2 it's not zero. It's like a significant percentage.
Speaker 1 I gave like 50-50.
Speaker 2
Oh, okay. Yeah.
Why is it? Tell me about the world in which the AI takes over but doesn't kill humans. Why would that happen and what would that look like?
Speaker 1 I mean, I ask ask my questions like, why would you kill humans? So I think like
Speaker 1 the, maybe I'd say the incentive to kill humans is like quite weak.
Speaker 2 They'll get in your way. They control shit you want.
Speaker 1 Oh, so taking shit from humans is a different, like marginalizing humans and like causing humans to be irrelevant is a very different story from killing the humans.
Speaker 1 I think I'd say like the actual incentives to kill the humans are quite weak.
Speaker 1 So I think like the big reasons you kill humans are like, well, one, you might kill humans if you're like in a war with them. And like it's hard to win the war without killing a bunch of humans.
Speaker 1 Like maybe most saliently here, if you like want to use use some biological weapons or some crazy shit, that might just kill humans.
Speaker 1 I think like you might kill humans just from totally destroying the ecosystems they're dependent on. And it's slightly expensive to keep them alive anyway.
Speaker 1 You might kill humans just because you don't like them. Or like you literally want to like, yeah, I mean.
Speaker 2 Neutralize a threat. Or
Speaker 2 the Eliezer line is that they're made of atoms you could use for something else.
Speaker 1 Yeah, I mean, I think the literal they're made of atoms is like quite... There are not many atoms in humans.
Speaker 1 Neutralize a threat as a similar issue where it's just like, I think think you would kill the humans if you didn't care at all about them.
Speaker 1 So maybe your question you're asking is like, why would you care at all about humans?
Speaker 1 But I think you don't have to care much to not kill the humans. Okay, sure.
Speaker 2 Because there's just so much raw resources elsewhere in the universe.
Speaker 1 Yeah, also, you can marginalize humans pretty hard.
Speaker 1 Like you could totally cripple human, like you could cripple humans' warfighting capability and also take almost all their stuff while killing only a small fraction of humans incidentally.
Speaker 1 So then if you ask like, why might AI not want to kill humans? I mean, a big thing is just like, look, I think AIs probably want a bunch of random crap for like complicated reasons.
Speaker 1 Like the motivations of AI systems and civilizations of AIs are probably complicated messes. Certainly amongst humans, it is not that rare to be like, well, there was someone here.
Speaker 1 I would like all else equal. If I didn't have to murder them, I would prefer not murder them.
Speaker 1 And my guess is it's also like a reasonable chance it's not that rare amongst AI systems. Like humans have a bunch of different reasons we think that way.
Speaker 1 I think AI systems will be very different from humans, but it's also just like a very salient.
Speaker 1 Yeah, I mean, I think this is a really complicated question.
Speaker 1 Like if you imagine imagine drawing values from the basket of all values, like what fraction of them are like, hey, if there's someone here, how much do I want to not murder them?
Speaker 1 And my guess is just like, if you draw a bunch of values from the basket, that's like a natural enough thing.
Speaker 1 Like if your AI wanted like 10,000 different things, or your civilization of the other one is 10,000 different things, just like reasonably likely you get like some of that.
Speaker 1 The other salient reason you might not want to murder them is just like...
Speaker 1 Well, yeah, there's some like kind of crazy decision theory stuff or like causal trade stuff, which does look on paper like it should work.
Speaker 1 And like if I was, if I was running a civilization and like dealing with some people who I didn't like at all or like didn't have any concern for at all, but I could only have to spend one billionth of my resources not to murder them, I think it's like quite robust that you don't want to murder them.
Speaker 1 That is, I think, the like weird decision theory, acausal trade stuff probably does carry the day.
Speaker 2 Oh, wait, that's that's that contributes more to that 50-50 of will they murder us if they take over than the
Speaker 2 well just by default they might just not want to kill us.
Speaker 1 Yeah, I think they're both salient.
Speaker 1 Can you explain the run together with each other a lot?
Speaker 2 Can you explain the weird acausal
Speaker 2 reason why AI might not kill us?
Speaker 1 Maybe a higher level thing that goes into both of these, and then I'll talk about how you instantiate an acausal trade, is just like it matters a lot to the humans not to get murdered.
Speaker 1 And the AI cares very, very little about whether, if we imagine this hypothetical, the reason it wants to kill humans is just total apathy, it cares very little about whether or not to murder humans because it is so easy to like marginalize humans without murdering them, and the resources required for human survival are extremely low.
Speaker 1 Again, in the context of this rapid AI industrialization. So
Speaker 1
that's the basic setting. And now the thing that you'd like to do is run a trade.
The AI would like to say, hey, humans, you care a ton about not getting murdered.
Speaker 1 I don't really care one way or the other. I would like to, if I could, find some way in which I don't murder you and then in return I get something.
Speaker 1 The problem is in that world, the humans have essentially nothing to give.
Speaker 1 That is, the humans are mostly irrelevant.
Speaker 1 However, from the perspective of any kind of reasonable AI, it's not that sure whether it lives in the world where the humans, in fact, have nothing to give or whether the humans, like, in fact, it lives in a world where the humans succeeded at building a land AI.
Speaker 1 And now the AI is simply running in a nice little simulation. And the humans are wondering, I wonder if this AI would have murdered us all if it had the chance.
Speaker 1 And the humans are saying, if it would murder us all if it had the chance, like, that sucks. We'd like to run this trade.
Speaker 1 We'd like to be nice to the AIs who wouldn't have murdered us all in order to create an incentive for AIs not to murder us.
Speaker 1 So what we do is we just check, and for the kinds of AIs who don't murder everyone, we just give them one billionth of the universe.
Speaker 1 Like, if you do that as humanity, then if you're an AI system considering, like, do I want to murder everyone? Your calculus is like,
Speaker 1 if this is my real chance to murder everyone, I get like the tiniest bit of value, you know, I get like one trillionth of the value or whatever, one billionth of the value.
Speaker 1 But on the other hand, like, if I don't murder everyone, there's like some worlds where then the humans will correctly determine I don't murder everyone.
Speaker 1 Because, in fact, the humans survive, the humans are running the simulations to understand how different AIs would behave. And so, like, that's a better deal.
Speaker 2 Let's hope they fall for that psychopath.
Speaker 2
Okay, that is interesting. Hey, real quick, quick, this episode is sponsored by Open Philanthropy.
Open Philanthropy is one of the largest grant-making organizations in the world.
Speaker 2 Every year, they give away hundreds of millions of dollars to reduce catastrophic risks from fast-moving advances in AI and biotechnology.
Speaker 2 Open Philanthropy is currently hiring for 22 different roles in those areas, including grant making, research, and operations.
Speaker 2 New hires will support Open Philanthropy's giving on technical AI safety, AI governance, AI policy in the US, EU, and UK, and biosecurity.
Speaker 2 Many roles are remote-friendly, and most of the grant-making hires that Open Philanthropy makes don't have prior grant-making experience.
Speaker 2 Previous technical experience is an asset, as many of these roles often benefit from a deep understanding of the technologies they address.
Speaker 2
For more information and to apply, please visit Open Philanthropy's website in the description. The deadline to apply is November 9th, so make sure to check out those roles before they close.
Awesome.
Speaker 2 Back to the episode. In a world where we've been deploying these AI systems,
Speaker 2 and suppose they're aligned,
Speaker 2 how hard would it be for
Speaker 2 competitors to,
Speaker 2 I don't know, cyber-attack them and get them to join the other side? Are they robustly going to be aligned?
Speaker 1 Yeah, I mean, I think in some sense, so there's a bunch of questions that come up here. First one is like, are aligned AI systems that you can build like competitive?
Speaker 1 Are they almost as good as the best systems anyone could build? And maybe we're granting that for the purpose of this question.
Speaker 1 I think the next question that comes up is like,
Speaker 1 AI systems right now are very vulnerable to manipulation.
Speaker 1 Like, it's not clear how much more vulnerable they are than humans, except for the fact that you can, like, if you have an AI system, you can just replay it like a billion times and search for like, what thing can I say that will make it behave this way?
Speaker 1
So as a result, like, AI systems are very vulnerable to manipulation. It's unclear if future AI systems will be similarly vulnerable to manipulation, but certainly seems plausible.
And in particular,
Speaker 1 aligned AI systems or unaligned AI systems would be vulnerable to all kinds of manipulation. The thing that's really relevant here is kind of like asymmetric manipulation or something.
Speaker 1 That is like if it is easier, so if everyone is just constantly messing with each other's AI systems, if you ever use AI systems in a competitive environment, a big part of the game is messing with your competitors' AI systems.
Speaker 1 A big question is whether there's some asymmetric factor there, where it's kind of easier to push AI systems into a mode where they're behaving erratically or chaotically or trying to grab power or something than it is to push them to fight for the other side.
Speaker 1 If it's just a game of two people are competing and neither of them can sort of like hijack an opponent's AI to help support their cause, that doesn't, I mean, it matters and it creates chaos and it might be quite bad for the world, but it doesn't really affect the alignment calculus.
Speaker 1 Now it's just like right now you have like normal cyber offense, cyber defense. You have a weird AI version of cyber offense, cyber defense.
Speaker 1 But if you have this kind of asymmetrical thing where like, you know, a bunch of AI systems who are like, we love AI flourishing can then go in and say like great AIs, hey, how about you join us?
Speaker 1 And that works.
Speaker 1 Like if they can search for a persuasive argument to that effect, and that's kind of asymmetrical, then the effect is whatever values it's easiest to push, like whatever it's easiest to argue to an AI that it should do, that is advantaged.
Speaker 1 So it may be very hard to build AI systems, like try and defend human interests, but very easy to build AI systems, just like try and destroy stuff or whatever.
Speaker 1 Just depending on what is the easiest thing to argue to an AI that it should do, or what's the easiest thing to trick an AI into doing, or whatever.
Speaker 1 Yeah, I think if alignment is spotty, if you have the AI system, which doesn't really want to help humans or whatever, or in fact,
Speaker 1 wants some kind of random thing or wants different things in different contexts,
Speaker 1 then I do think adversarial settings will be the main ones where you see the system, or the easiest ones, where you see the system behaving really badly.
Speaker 1 And it's a little bit hard to tell how that shakes out.
Speaker 2 Okay, and suppose it is more reliable. How concerned are you that whatever alignment technique you come up with,
Speaker 2 you know, you publish the paper, this is how the alignment alignment works.
Speaker 2 How concerned are you that Putin reads it or China reads it and now they understand, for example, the constitutional AI think for anthropic and then you just write on there, oh,
Speaker 2 never contradict Mao Zedong thought or something.
Speaker 2 How concerned should we be that these alignment techniques are universally applicable, not necessarily just for enlightened goals?
Speaker 1 Yeah, I mean, I think they're super universally applicable. I think it's just like, I mean, the rough way I would describe it, which I think is basically right, is like
Speaker 1 some degree of alignment makes AI systems much more usable.
Speaker 1 You should just think of the technology of AI as including a basket of some AI capabilities and some getting the AI to do what you want. It's just part of that basket.
Speaker 1 And so anytime we're like, you know, to the extent alignment is part of that basket, you're just contributing to all the other harms from AI.
Speaker 1 You're reducing the probability of this harm, but you are helping the technology basically work.
Speaker 1 And the basically working technology is kind of scary from a lot of perspectives, one of which is like, Right now, even in a very authoritarian society, just like humans have a lot of power because you need to rely on just a ton of humans to do your thing.
Speaker 1 And in a world where AI is very powerful,
Speaker 1 it is just much more possible to say, here's how our society runs. One person calls the shots and then a ton of AI systems do what they want.
Speaker 1 I think that's a reasonable thing to dislike about AI and a reasonable reason to be scared to push the technology to be really good.
Speaker 2 Trevor Burrus: But is that also a reasonable reason to be concerned about alignment as well? That
Speaker 2 this is in some sense also capabilities.
Speaker 2 You're teaching people how to get these systems to do what they want. Aaron Powell, yeah.
Speaker 1 I mean, I would generalize. So we earlier touched a little bit on the potential moral rights of AI systems.
Speaker 1 And now we're talking a little bit about how AI systems, powerful AI disempowers humans and can empower authoritarians. I think we could list other harms from AI.
Speaker 1 And I think it is the case that if alignment was bad enough, people would just not build AI systems. And so
Speaker 1 I think there's a real sense in which you should just be scared.
Speaker 1 To the extent you're scared of all AI, you should be like, well, alignment, although it helps with one risk, does contribute to AI being more of a thing.
Speaker 1 I do think you should shut down the other parts of AI before, if you were a policymaker or like a researcher or whatever looking in on this, I think it's like crazy to be like, this is the part of the basket we're going to remove.
Speaker 1 Like you should, you should first remove like other parts of the basket because they're also part of the story of risk.
Speaker 2 Wait, does that imply you think if, for example, all capabilities research
Speaker 2 was shut down, that you think it'd be a bad idea to continue doing alignment research in isolation of what is conventionally considered capabilities research?
Speaker 1 I mean, if you told me it was never going to restart, then it wouldn't matter. And if you told me it's going to restart, I guess it would be a kind of similar calculus to today.
Speaker 2 Whereas
Speaker 2 it's going to to happen, so you should have something.
Speaker 1 Yeah, I think that in some sense, you're always going to face this trade-off, where alignment makes it possible to deploy AI systems, or it makes it more attractive to deploy AI systems.
Speaker 1 Or in the authoritarian case, it makes it tractable to deploy them for this purpose.
Speaker 1 And
Speaker 1 if you didn't do any alignment, there'd be a nicer, bigger buffer between your society and malicious uses of AI. And I think it's one of the most expensive ways to maintain that buffer.
Speaker 1 It's much better to maintain that buffer by not having the compute or not having the powerful AI.
Speaker 1 But I think if you're concerned enough about the other risks, there's definitely a case to be made for just putting in more buffer or something like that.
Speaker 1 I'm not, like, I care enough about the takeover risk that
Speaker 1 I think it's just not a net positive way to buy buffer.
Speaker 1 That is, again, the version of this that's most pragmatic is just like, suppose you don't work on alignment today. It decreases economic impact of AI systems.
Speaker 1 They'll be less useful if they're less reliable and if they more often don't do what people want. And so you could be like, great, that just buys time for AI.
Speaker 1 And you're getting some trade-off there where you're decreasing some risks of AI. If AI is more reliable or more does what people want or is more understandable, then that cuts down some risks.
Speaker 1 But if you think AI is on balance bad, even apart from takeover risk,
Speaker 1 then like the alignment stuff can easily end up being non-negative.
Speaker 2 But presumably you don't think that, right? Because
Speaker 2 I guess this is something people have brought up to you because
Speaker 2 you invented RLHF, which was used to train ChatGPT, and ChatGPT brought AI to the front pages everywhere. So
Speaker 2 I actually wonder if you could measure how much more money went into AI because like how much people have raised in the last year or something. But it's got to be billions,
Speaker 2 the counterfactual impact of that, that went into the AI investment and the talent that went into AI, for example. And so presumably you think that was worth it.
Speaker 2 So I guess you're hedging here about what is the reason
Speaker 2 that it's worth it.
Speaker 1 Yeah, like what's the total trade-off there? Yeah. I mean, I think my take is like,
Speaker 1 I think slower AI development on balance is quite good.
Speaker 1 I think that slowing AI development now or having less press around ChatGPT is a little bit more mixed than slowing AI development overall.
Speaker 1 I think it's still probably positive, but much less positive because I do think there's a real effect of
Speaker 1 the world is starting to get prepared or is getting prepared at a much greater rate now than it was prior to the release of ChatGPT.
Speaker 1 And so if you can choose between progress now or progress later, you really prefer to have more of your progress now, which I do think slows down progress later.
Speaker 1 I don't think that's enough to flip the sign. I think maybe it was in the far enough past, but now I would still say moving faster now is net negative.
Speaker 1 But to be clear, it's a lot less net negative than merely accelerating AI.
Speaker 1 Because I do think, again, the chat GPT thing, I am glad people are having policy discussions now rather than delaying the chat GPT wake-up thing by a year and then
Speaker 2 having policy. Oh, wait, ChatGPT was net negative or RLHF was net negative.
Speaker 1 So here, just on the acceleration, it's just like how is the press of ChatGPT? My guess is like,
Speaker 1 my guess is net negative, but I think it's not super clear. And it's like much less, much less than slowing AI.
Speaker 1 Slowing slowing AI is great if you could slow overall AI progress I think slowing AI by like causing
Speaker 1 you know there's this issue where slowing AI now like for chat GPT you're building up this backlog like why does chat GPT make such a splash like I think people there's a reasonable chance if you don't have a splash about chat gpt you have a splash about gpt4 and if you fail to have a splash about gpt4 is a reasonable chance of a splash about gpt4.5 and just like as that happens later there's just like less and less time between that splash and between when an AI potentially kills everyone right so people do governments are talking about if SDR now and people aren't.
Speaker 2 But okay, so let's talk about the slowing down.
Speaker 1
So this is also all one sub-component of like the overall impact. And I was just saying this to like, just to like briefly give the roadmap for the overall too long answer.
Like,
Speaker 1
there's a question of what's the calculus for speeding up. I think speeding up is pretty rough.
I think speeding up like locally is a little bit less rough.
Speaker 1 And then, yeah, I think that the effect, like the overall effect size from like doing alignment work on reducing takeover risk versus speeding up AI is like pretty good.
Speaker 1
Like I think, yeah, I think it's pretty good. I think you reduce takeover significantly before you like speed up AI by a year or whatever.
Okay, got it.
Speaker 2 If it's good to like slowing down AI is good, presumably because it gives you more time to do alignment. But alignment also helps speed up AI.
Speaker 2 RLHF is alignment and it helped with ChatGPT, which sped up AI.
Speaker 2 So I actually don't understand how the feedback loop nets out other than the fact that if AI is happening, you need to do alignment at some point, right? So, I mean, you can't just not do alignment.
Speaker 1 Yeah, so I think if the only reason you thought faster AI progress was bad was because it gave less time to do alignment, then there would just be no possible way that the calculus comes out negative for alignment.
Speaker 1 You're like, maybe alignment speeds up AI, but the only purpose of slowing down AI was to do it. Like, it's just...
Speaker 1 It could never come out ahead.
Speaker 1 I think the reason that you can come out ahead, the reason you could end up thinking the alignment was net negative, was because there's a bunch of other stuff you're doing that makes AI safer.
Speaker 1 Like if you think the world is like gradually coming better to terms with the impact of AI or policies being made or like you're getting increasingly prepared to handle the threat of authoritarian abuse of AI.
Speaker 1 If you think other stuff is happening that's improving preparedness, then you have reason beyond alignment research to slow down AI.
Speaker 2 Actually,
Speaker 2 how big a factor is that? So let's say right now we hit pause and you have 10 years of no alignment, no capabilities, but just people get to talk about it for 10 years.
Speaker 2 How much more does that prepare people than
Speaker 2 we only have one year versus we have no time? Like is just dead time where no research and alignment or capability capabilities is happening? Like, is there, like, what does that dead time do for us?
Speaker 1 Well, I mean, right now it seems like there's a lot of policy stuff you'd want to do.
Speaker 1 This seemed like less plausible a couple of years ago, maybe, but if like the world just knew they had a 10-year pause right now, I think there's a lot of sense of like, we have policy objectives to accomplish.
Speaker 1 If we had 10 years, we could pretty much do those things. Like, we'd have a lot of time to debate.
Speaker 1 measurement regimes, debate like policy regimes and containment regimes, and a lot of time to set up those institutions.
Speaker 1 So if you told me that the world knew it was a pause, it wasn't like people just see that AI progress isn't happening, but they're told like you guys have been granted or like cursed with a 10-year no AI progress, no alignment progress pause.
Speaker 1 I think that would be quite good at this point.
Speaker 1 However, I think it would be much better at this point than it would have been two years ago.
Speaker 1 And so like the entire concern with like slowing AI development now rather than taking the 10-year pause is just like, if you slow the AI development by a year now, my guess is some gets clawed back by low-hanging fruit gets picked faster in the future.
Speaker 1
My guess is you lose like half a year or something like that in the future. Maybe even more, maybe like two-thirds of a year.
So it's like, you're trading time now for time in the future at some rate.
Speaker 1 And it's just like, that eats up like a lot of the value of the slowdown.
Speaker 2 And the crucial point being that time in the future matters more because you have more information, people are more bought in, and so on.
Speaker 1 Yeah, the same reason that I'm more excited about policy change now than two years ago.
Speaker 1 So my overall view is just like in the past, this calculus, this calculus changes over time, right?
Speaker 1 The more people are getting prepared, the better the calculus is for slowing down at this very moment.
Speaker 1 And I think like now the calculus is, I would say, positive for just even if you pause now and it would get clouded back in the future, I I think the pause now is just good because enough stuff is happening.
Speaker 1 We have enough idea of like, probably even apart from alignment research, and certainly if you include alignment research, just like enough stuff is happening where the world is getting more ready and coming more to terms with impacts that I just think it is worth it, even though some of that time is going to get clawed back.
Speaker 1 Again, especially if like, there's a question of during a pause, does like NVIDIA keep making more GPUs? Like, that sucks if they do. If you do a pause, but like, yeah.
Speaker 1 In practice, if you did a pause, then NVIDIA probably couldn't keep making more GPUs because, in fact, like the demand for GPUs is really important for them to do that.
Speaker 1 But if you told me you just get to scale up hardware production and building the clusters, but not doing AI, then that's back to being that negative, I think, pretty clearly.
Speaker 2 Having brought up the fact that we want some sort of measurement scheme for these capabilities, let's talk about responsible scaling policies. Do you want to introduce what this is?
Speaker 1 Sure.
Speaker 1 So
Speaker 1 I guess the motivating question is like, what should AI labs be doing right now to manage risk and to sort of build good habits or practices for managing risk into the future?
Speaker 3 And
Speaker 1 I think my take is that current systems pose, from a catastrophic risk perspective, not that much risk today.
Speaker 1 That is, a failure to control or understand GP24 can have real harms, but doesn't have much harm with respect to the kind of takeover risk I'm worried about, or even much catastrophic harm with respect to misuse.
Speaker 1 So I think if you want to manage catastrophic harms, I think right now you don't need to be that careful with GP24.
Speaker 1 And so to the extent you're like, what should labs do? I think the single most important thing seems like
Speaker 1 understand
Speaker 1 whether that's the case, notice when that stops being the case, have a reasonable roadmap for what you're actually going to do when that stops being the case.
Speaker 1 So that motivates this set of policies, which I've sort of been pushing for labs to adopt.
Speaker 1
which is saying like, here's what we're looking for. Here's some threats we're concerned about.
Here's some capabilities that we're measuring. Here's the level.
Speaker 1 Here's the actual concrete measurement results that would suggest to us that those threats are real. Here's the action we would take in response to observing those capabilities.
Speaker 1 If we couldn't take those actions, like if we've said that we're going to secure the weights, but we're not able to do that, we're going to pause until we can take those actions.
Speaker 1 Yeah, so this sort of,
Speaker 1 again, I think it's like
Speaker 1 motivated primarily by like, what should you be doing as a lab to manage catastrophic risk now in a way that's like a reasonable precedent and habit and policy for continuing to implement into the future.
Speaker 2 And which labs, I don't know if this is public yet, but which labs are cooperating on this?
Speaker 1 Yeah, so I mean, Anthropic has written this document, their current responsible scaling policy.
Speaker 1
And then have been talking with other folks. I guess don't really want to comment on other conversations.
But I think in general, people who are more interested in
Speaker 1 or more think you have plausible catastrophic harms on a five-year timeline are more interested in this. And there's not that long a list of suspects like that.
Speaker 2 There's not that many laps.
Speaker 2 Okay, so if these companies would be willing to coordinate and say at these different benchmarks, we're going to make sure we have these safeguards. What happens?
Speaker 2 I mean, there are other companies and other countries which care less about this.
Speaker 2 Are you just slowing down the companies that are most aligned?
Speaker 1 I think the first order business is understanding like
Speaker 1 sort of what is actually a reasonable set of policies for managing risk.
Speaker 1 I do think there's a question of like you might end up in a situation where you say, like, well, here's what we would do in an ideal world if we, like,
Speaker 1 you know, if everyone was behaving responsibly, we'd want to keep risk to 1% or a couple percent or whatever, maybe even lower levels, depending on how you feel.
Speaker 1 However, in the real world, like, there's enough of a mess. Like, there's enough unsafe stuff happening that actually it's worth making, like, larger compromises.
Speaker 1 Or if we don't kill everyone, someone else will kill everyone anyway. So, actually, the counterfactual risk is much lower.
Speaker 1 I think, like, if you end up in that situation, it's still extremely valuable to have said, like, here's here's the policies we'd like to follow. Here's the policies we've started following.
Speaker 1
Here's why we think it's dangerous. Here's the concerns we have if people are following significantly laxer policies.
And then this is maybe helpful as an input to or model for potential regulation.
Speaker 1 It's helpful for being able to just produce clarity about what's going on. I think historically there's been considerable concern about developers being more or less safe, but there's not that much.
Speaker 1 like legible differentiation in terms of what their policies are. I think getting to that world would be good.
Speaker 1 it's a very different world if you're like, Actor X is developing AI and I'm concerned that they will do so in an unsafe way, versus if you're like, look, we take security precautions or safety precautions XYZ.
Speaker 1 Here's why we think those precautions are desirable or necessary. We're concerned about this other developer because they don't do those things.
Speaker 1 I think it's just like a qualitatively, it's kind of the first step you would want to take in any world where you're trying to get people on side or like trying to move towards regulation that can manage risk.
Speaker 2 How about the concern that you have these evaluations and let's say you declare to the world our new model has
Speaker 2 a capability to help develop bioweapons or help you make cyber attacks and therefore we're pausing right now until you can figure this out and China hears this and thinks oh wow they can help us you know make a cyber attacks and then just steals the weights does this scheme work in the current regime where we can't ensure that China doesn't just steal the weights and more so are you increasing the salience of dangerous models so that
Speaker 2 you blur this out and then people want the weights now because they know what they can do?
Speaker 1 Yeah, I mean, I think the general discussion does emphasize potential harms or potential. I mean, some of those are harms and some of those are just like impacts that are very large.
Speaker 1 And so it might also be an inducement to develop models.
Speaker 1 I think that part, if you're for a moment ignoring security and just saying that may increase investment, I think it's like on balance just quite good for people to have an understanding of potential impacts just because it is an input both into proliferation but also into like regulation or safety.
Speaker 1 With respect to things like security of either weights or other IP,
Speaker 1 I do think you want to have moved to significantly more secure handling of model weights before the point where a leak would be catastrophic. And indeed,
Speaker 1 for example, in Anthropic's document or in their plan,
Speaker 1 security is one of the first sets of tangible changes. That is at this capability level, we need to have such security practices in place.
Speaker 1 So I do think that's just one of the things you need to get get
Speaker 1 in place at a relatively early stage because it does undermine the rest of the measures you may take. And it's also just part of the easiest,
Speaker 1 like if you imagine catastrophic harms over the next couple of years, I think security failures are kind of play a central role in a lot of those.
Speaker 1 And maybe the last thing to say is like, it's not clear that you should say we have paused because we have models that can develop bioweapons versus just like, I mean, potentially not saying anything about like what models you've developed or at least saying like, hey, like, by the way, here's like, you know, here's the set of practices we currently implement.
Speaker 1 Here's the set of capabilities our models don't have. We're just not even talking that much.
Speaker 1 Sort of the minimum of such a policy is to say, here's what we do from the perspective of security or internal controls or alignment. Here's a level of capability which we'd have to do more.
Speaker 1 And you can say that and you can raise your level of capability and raise your protective measures
Speaker 1 before your models hit your previous level.
Speaker 1 It's fine to say like we are prepared to handle a model that has such and such extreme capabilities like prior to actually having such a model at hand, as long as you're prepared to move your protective measures to that regime.
Speaker 2 Okay, so let's just get to the end where
Speaker 2 you think you're a generation away or a little bit more scaffolding away from a model that is human level and subsequently could cascade an intelligence explosion.
Speaker 2 What do you actually do at that point? Like, what is the level of
Speaker 2 evaluation of safety where you would be satisfied of releasing a human-level model?
Speaker 1 There's a couple points that come up here. So one is like this threat model of like
Speaker 1 sort of automating R ⁇ D, or like independent of whether AI can do something on the object level that's potentially dangerous.
Speaker 1 I think it's reasonable to be concerned if you have an AI system that might,
Speaker 1 if leaked, allow other actors to quickly build powerful AI systems or might allow you to quickly build much more powerful systems.
Speaker 1 Or might, if you're trying to hold off on development, just itself be able to create much more powerful systems.
Speaker 1 So I think one question is how to handle that kind of threat model as distinct from a threat model like this could enable destructive bioterrorism or this could enable massively scaled cybercrime or whatever.
Speaker 1 And I think I am unsure how you should handle that.
Speaker 1 I think right now, implicitly, it's being handled by saying, look, there's a lot of overlap between the kinds of capabilities that are necessary to cause various harms and the kinds of capabilities that are necessary to accelerate ML.
Speaker 1 So we're kind of going to catch those with an early warning sign for both and deal with the resolution of this question a little bit later. So for example, in Anthropics Policy, they have this
Speaker 1 sort of autonomy in the lab benchmark, which I think is probably occurs prior to either really massive AI acceleration or to most potential catastrophic, like object-level catastrophic harms.
Speaker 1 And the idea is that's like a warning sign that lets you punt. So this is a bit of a digression in terms of how to think about that risk.
Speaker 1 I think I am unsure whether you should be addressing that risk directly and saying we're scared to even work with such a model, or if you should be mostly focusing on object-level harms and saying, okay, we need more intense precautions to manage object-level harms because of the prospect of very rapid change.
Speaker 1 And the availability of this AI just
Speaker 1 creates that prospect.
Speaker 1 Okay, this is all still a digression.
Speaker 1 So if you had a model which you you thought was like potentially very scary, either on the object level or because of leading to this sort of intelligence explosion dynamics,
Speaker 1 I mean, things you want in place are like, you really do not want to be leaking the weights to that model.
Speaker 1
Like you don't want the model to be able to run away. You don't want human employees to be able to leak it.
You don't want external attackers or any set of all three of those coordinating.
Speaker 1 You really don't want internal abuse or tampering with such models.
Speaker 1 So if you're producing such models, you don't want it to be the case that a couple of employees could change the way the model works or could do something that violates your policy easily with that model.
Speaker 1 And if a model is very powerful, even the prospect of internal abuse could be quite bad. And so you might need significant internal controls to prevent that.
Speaker 2 Sorry if we're already getting to it, but the part I'm most curious about is asseparate from the ways in which other people might fuck with it, like what is it, you know, it's isolated.
Speaker 2 What is the point at which we satisfied?
Speaker 2 It in and of itself is not going to pose a risk to humanity. It's human level, but
Speaker 2 we're happy with it.
Speaker 1 Yeah. So I think here, so I listed maybe the two most simple ones that start out.
Speaker 1 Like security and internal controls, I think become relevant immediately and are like very clear why you care about them.
Speaker 1 I think as you move beyond that, it really depends like how you're deploying such a system.
Speaker 1 So I think like if your model, if you have good monitoring and internal controls and security and you just have weights sitting there, I think you mostly have addressed the risk from the weights just sitting there.
Speaker 1 Like now what you're talking about for risk is mostly...
Speaker 1 and like Maybe there's some blurriness here of how much like internal controls captures not only employees using the model, but like anything a model can do internally.
Speaker 1 You would really like to be in a situation where your internal controls are robust,
Speaker 1 not just to humans, but to models, potentially, e.g., a model shouldn't be able to subvert these measures.
Speaker 1 And you care, just as you care about, are your measures robust if humans are behaving maliciously? You care about, are your measures robust if models are behaving maliciously.
Speaker 1 So I think beyond that, if you've then managed the risk of just having the weight sitting around, now we talk about, in some sense, most of the risk comes from doing things with the model.
Speaker 1 You need all the rest so that you have any possibility of applying the brakes or implementing a policy. But at some point, as the model gets competent, you're saying, okay,
Speaker 1 could this cause a lot of harm, not because it leaks or something, but because we're just giving it a bunch of actuators. We're deploying it as a product, and people could do crazy stuff with it.
Speaker 1 So if we're talking not only about a powerful model, but a really broad deployment of just something similar to the OpenAI's API, where people can do whatever they want with this model, and maybe the economic impact is very large.
Speaker 1 So in fact, if you deploy that system, it will be used in a lot of places, such that if AI systems wanted to cause trouble, it would be very, very easy for them to cause catastrophic harms.
Speaker 1 Then I think you really need to have some kind of, I mean, I think probably the like science and discussion has to improve before this becomes that realistic, but you really want to have some kind of like alignment analysis, guarantee of alignment before we're comfortable with this.
Speaker 1 And so by that, I mean like
Speaker 1 You want to be able to bound the probability that someday all the AI systems will do something really harmful, that there's like something that could happen in the world that would cause like these large-scale correlated failures of your AIs.
Speaker 1
And so for that, like, I mean, there's sort of two categories. That's like one.
The other thing you need is protection against misuse of various kinds, which is also quite hard.
Speaker 2 And by the way, which one are you worried about more? Misuse or misalignment?
Speaker 1 I mean, in the near term, I think harms from misuse are like, especially if you're not restricting to the tail of extremely large catastrophes, I think the harms from misuse are clearly larger in the near term.
Speaker 2 But actually, on that, let me ask, because
Speaker 2 if you think that it is the case that they are simple recipes for destruction that are further down the tech tree, by that I mean, you're familiar, but just for the audience,
Speaker 2 there's some way to configure $50,000 and a teenager's time to destroy a civilization. If that thing is available, it then misuses itself a teal risk, right?
Speaker 2 So do you think that that prospect is less likely than
Speaker 1 a way you could put it is there's like a bunch of potential destructive technologies.
Speaker 1 And like alignment is about like AI itself being such a destructive technology, where like even if the world just uses the technology of today, simply access to AI could cause like human civilization to have serious problems.
Speaker 1 But But there's also just a bunch of other potential destructive technologies. Again, we mentioned physical explosives or bioweapons of various kinds, and then the whole tale of who knows what.
Speaker 1 My guess is that
Speaker 1 alignment becomes a catastrophic issue prior to most of these. That is prior to some way to spend $50,000 to kill everyone, with the salient exception of possibly bioweapons.
Speaker 1 So that would be my guess. And then
Speaker 1 there's a question of what is your risk management approach not knowing what's going on here. And you don't understand whether there's some way to use $50,000.
Speaker 1 But I think you can do things like understand how good is an AI at coming up with such schemes. You can talk to your AI and be like, does it produce new ideas for destruction we haven't recognized?
Speaker 1 Yeah,
Speaker 2 not whether we can evaluate it, but whether, if such a thing exists, and if it does, then the misuse itself is an existential risk.
Speaker 2 Because it's even like earlier you were saying misalignment is where the existential risk comes from, but misuse is where the sort of
Speaker 2 short-term dangers come from.
Speaker 1 Yeah, I mean, I think ultimately you're going to have a lot of destructive, like if you look at the entire tech tree of humanity's future, I think you're going to have a fair number of destructive technologies most likely.
Speaker 1 And I think several of those will likely pose existential risks in part just kind of, you know, if you imagine a really long future, a lot of stuff's going to happen.
Speaker 1 And so when I talk about like where the existential risk comes from, I'm mostly thinking about like, comes from when.
Speaker 1 Like at what point do you face what challenges or in what sequence.
Speaker 1 And so I'm saying like, I think misalignment misalignment is probably like, one way of putting it is if you imagine AI systems sophisticated enough to discover like destructive technologies that are totally not in our radar right now, I think those come well after AI systems like capable enough that if misaligned, they would be catastrophically dangerous.
Speaker 1 There's like the level of competence necessary to, if broadly deployed in the world, bring down a civilization is much smaller than the level of competence necessary to advise one person on how to bring down a civilization.
Speaker 1 Just because in one case,
Speaker 1 you already have a billion copies of yourself or whatever.
Speaker 1 So yeah, I think it's mostly just the sequencing thing though. Like in the very long run, I think like you care about like, hey,
Speaker 1 AI will be expanding the frontier of dangerous technologies. We want to have some policy for like exploring or understanding that frontier and like whether we're about to turn up something really bad.
Speaker 1 I think those policies can become really complicated.
Speaker 1 Right now, I think RSPs can focus more on like we have our inventory of like the things that a human is going to do to cause a lot of harm with access to AI probably are things that are on our radar.
Speaker 1 That is like, they're not going to be completely unlike things that the human could do to cause a lot of harm with access to weak weak AIs or with access to other tools.
Speaker 1 I think it's not crazy to initially say, like, that's what we're doing.
Speaker 1 We're like looking at the things closest to human and humans being able to cause huge amounts of harm and asking which of those are taken over the line. But eventually, that's not the case.
Speaker 1 Eventually, like AIs will enable just like totally different ways of killing a billion people.
Speaker 2 But I think I interrupted you on the initial question of, yeah, so human-level AI,
Speaker 2 not from leaking, but from deployment.
Speaker 2 What is the point at which you'd be comfortable deploying a human-level AI?
Speaker 1 Yeah, Yeah, so again, there's sort of like some stuff you care about on the misuse side and some stuff you care about on the misalignment side.
Speaker 1 And there's probably further things you care about, especially to extend your concerns about broadly and catastrophic risks.
Speaker 1 But maybe I most want to talk about just like what you care about on the alignment side, because it's the thing I've actually thought about most and also the thing I care about a lot.
Speaker 1 Also, I think a significant fraction of the existential risk over the kind of foreseeable future.
Speaker 1 So on that front,
Speaker 1 I broadly think there's
Speaker 1 two kinds. If you ask me right now, what evidence for alignment could make you comfortable?
Speaker 1 I think my best guess would be to provide two kinds of evidence. So one kind of evidence is on the like, could you detect or prevent catastrophic harm if such a system was misaligned?
Speaker 1
So I think there's like a couple of things you would do here. One thing you would do is on this like adversarial evaluation front.
So you could try and say, for example, like
Speaker 1 We have tried to test our system in a broad diversity of situations that reflect cases where it might cause harm.
Speaker 1 Like cases where someone is trying to recruit it to like, anyway, cases where it might cause harm, or cases where a large number of AI systems might sort of all decide to work across purposes to humans, like the robot rebellion is starting or whatever.
Speaker 1 We've done evaluation in like a broad distribution of situations.
Speaker 1 And then we have tried to argue that our AI is actually like those tests are indicative of the real world.
Speaker 1 So we've tried to argue like, hey, actually, the AI is not very good at distinguishing situations we produce in the lab as tests from similar situations that occur in the real world.
Speaker 1 And the coverage of this distribution is reasonable.
Speaker 1 This is an example of a thing you can do that's not really making assumptions about what's going on inside the model or about how the model, like is the model aligned. It's more trying to say,
Speaker 1 is it the case that our testing or monitoring machinery could detect or react to something bad happening?
Speaker 1 In this case, could it detect a system which would under some conditions do something really bad?
Speaker 1 So that's like, there's a lot of stuff in that genre. And I tried to throw out that example, but I think there's a big, like that on its own wouldn't be compelling even from this perspective.
Speaker 1 You kind of need a lot of pieces together. And that's kind of a demanding technical project, but I think one without necessarily deep conceptual difficulty.
Speaker 1 A big problem with it is it's liable to fail.
Speaker 1 Like, it's liable you're going to do that testing or say, we think these tests are representative because the AI can't distinguish tests from the real world, which you can maybe achieve now, but will probably become impossible to achieve at some level.
Speaker 1 It's not clear what's at human level or how far past human level.
Speaker 2 Let's just make that example concrete for the audience.
Speaker 2 Correct me if this is wrong, but it sounds like you basically give it the opportunity to do a coup or make a bioweapon or whatever in testing in a situation where it thinks it's the real world.
Speaker 2 And you're like, it didn't do any of that. So
Speaker 2 is that what you're talking about?
Speaker 1 That's the kind of thing. Okay.
Speaker 1 And it's not the greatest.
Speaker 1 So dad color, when I say I can't tell the difference, like the kind of evaluation you would love to do if you could is to say like, hey, we have a bunch of real-world deployment.
Speaker 1 We have a bunch of data from those deployments. We also have a bunch of in-the-lab tests of situations.
Speaker 1 If we look at like in-the-lab tests of a situation together with like real-world cases in which the situation occurs, and then we fine-tune a model to try and distinguish the difference there, the model can't even tell when fine-tuned the difference between tests and real-world cases.
Speaker 1 And if the model can tell the difference, which is almost certainly going to be the case at some level of capability, then you say, okay, our first line of hoped-for defense has failed.
Speaker 1 And so now we need to understand how can tests be indicative when models are capable enough to tell the difference between the tests and the real world.
Speaker 1 Now you need to start arguing about what is the model paying attention to.
Speaker 1 Basically, it's like a first line of if you get lucky, lucky what would work here and then there's the like deeper like you probably have to do novel science at some point to like make this case um all of this was part of like i was saying maybe you want to have kind of two lines of defense all this was like part of this first line of defense of like you know can you detect something is going wrong or can you prevent harm from occurring can you have monitoring that will react quickly enough to avoid catastrophe i think you probably also want to have a second line of defense that's more
Speaker 1 Maybe this is even more important than the first one is just understanding whether dangerous forms of misalignment can occur. So like the best reference point for this, I think, is the like,
Speaker 1 there have been a couple projects like this in academia.
Speaker 1 Anthropic has been working on a project or spun up a team doing this kind of work, trying to say like, can dangerous forms of reward hacking occur?
Speaker 1 Or can deceptive alignment, in fact, occur in the lab?
Speaker 1 And like here, what you want to do is either say like, okay, even if we create, you know, best case, even if you create optimal conditions for deceptive alignment or for reward hacking, we actually can't, we just can't cause it to occur even in the lab.
Speaker 1 And if you do a good enough job of that, I think it could give you some evidence.
Speaker 1 And again, more likely that fails and that when you create optimal conditions, you do see a deceptive alignment and reward hacking in the lab.
Speaker 1 But then once you have that data, once you can say, okay, in the lab, actually, these things can occur, then you can start saying, and we have a robust scientific understanding that enables us to fix those problems when they occur.
Speaker 1 Or then you start, you have this fork on your second line where you say, either it is very hard, or we're not able to create conditions where these failures emerge, or I think more likely, we are able to create those conditions.
Speaker 1 And here's the story about why we are able to detect those in a way that would work in the real world. And we can say they don't occur in the real world.
Speaker 2 How do you create the optimal conditions for it to want to be deceptive? Do you fine-tune it on MindConf or what are you doing?
Speaker 1 Yeah, so for deceptive alignment, I think it's really complicated.
Speaker 1 Yeah, I think this is a very hard project.
Speaker 1 I think one of the things, one of the most important things for people concerned about misalignment to do is to ensure that adequate effort is put into this task.
Speaker 1 And we have, in fact, created good conditions in the laboratory. And there's real disagreement about how informative those will be.
Speaker 1 But okay, if you were to guess for deceptive alignment, how would you do this?
Speaker 1 An example is: suppose you train an AI system that
Speaker 1 has some medium horizon goal. You train your AI system to
Speaker 1 whatever,
Speaker 1 get some paperclips, wants to produce as many paperclips as it can over the next five days. So just select actions really aggressively for producing paperclips over the next five days.
Speaker 1
You do your RLHF, you do your pre-training, whatever. That's like your phase one.
You also ensure your AI system has a really good understanding of how it's trained.
Speaker 1 So this AI system wants paperclips and it understands everything about how it's trained and everything about how it's fine-tuned. And you train on just a lot of this data.
Speaker 1 And they say, okay, if we've done all of that, We have this concern that if a system wants paperclips and understands really well how it's trained, then it will like, if it's going to be trained to like get apples instead of paperclips, it's just going to do some cost benefit and be like, ah, you know, really, while I'm being trained to get apples, I should do that.
Speaker 1
But I should do that whether or not, even if I want paperclips, I should still do that. So training won't really affect its values.
It will just understand that it's being trained to get apples.
Speaker 1 It will try and get apples. And if you take it out of training, it will go back to getting paperclips.
Speaker 1 I think this exact setup has a number of problems or a number of disanalogies from the real world. But this is a setup which is relatively conducive to deceptive alignment.
Speaker 1 Produce a system that wants one thing, tell it a lot about its training, the kind of information that you might expect a system would get, and then try and understand whether, in fact,
Speaker 1 it is able to then or it tends or sometimes under optimal conditions, in fact, continues pursuing paperclips, only pursuing apples when it thinks it's being trained.
Speaker 2 All right. And this actually might be a good jumping off point into your current research.
Speaker 2 But so in that case where it wants to pursue paperclips, what would it look like for you to be say, well, we figured out why it wants paperclips and we figured out how to make sure it doesn't optimize for that at our expense when we deploy it?
Speaker 1 Yeah, I think that, so at a meta level, in terms of like, what's your protection like? I think like what you want to be saying is we have these examples in the lab of something bad happening.
Speaker 1 We're concerned about the problem at all because we have examples in the lab.
Speaker 1 And again, this should all be an addition, I think. You kind of want this defense in depth of saying we also have this testing regime that would detect problems for the deployed model.
Speaker 1 We have our problems in the lab. We then have some techniques which we believe address these problems.
Speaker 1 We believe that adversarial training fixes this, or we believe that our interpretability method will reliably detect this kind of deceptive alignment, or we believe our anomaly detection will reliably detect when the model goes from thinking it's being trained to thinking it should defect.
Speaker 1 And then you can say in the lab, we have some understanding of when those techniques work and when they don't.
Speaker 1 We have some understanding of the relevant parameters for the real system that's deployed, and we have a reasonable margin of safety.
Speaker 1 So we have reasonable robustness on our story about when this works and when it doesn't. And we can apply that margin of safety, like with a margin of safety to the real deployed system.
Speaker 1 So I think this is the kind of story you want to build towards in the long run. Do your best to produce all the failures you can in the lab or versions of them.
Speaker 1 Do your best to understand what causes them,
Speaker 1 what kind of anomaly detection actually works for detecting this or what kind of filtering actually works and then
Speaker 1 apply that and that's at the meta level it's not talking about like what what actually are those measures that would work effectively which is obviously like what I mean a lot of alignment research is really based on this hypothetical of like someday there will be AI systems that fail in this way what would you want to do what can we have the technologies ready either because we might never see signs of the problem or because like we want to be able to move fast once we see signs of the problem
Speaker 1
And obviously most of my life is in that. I am really in that bucket.
Like I mostly do alignment research.
Speaker 1 It's just building out the techniques that do not have these failures such that they can be available as an alternative if in fact these failures occur. Got it, okay.
Speaker 1 Ideally they'll be so good that even if you haven't seen them you would just want to switch to reasonable methods that don't have these or ideally they'll work as well or better than normal training.
Speaker 2 But
Speaker 2 ideally what will work better than the training?
Speaker 1 Yeah so our quest is to design training methods for which we don't expect them to lead to reward hacking or don't expect them to lead to receptive alignment.
Speaker 1 Ideally that won't be like a huge tax where people are like, well, we use those methods only if we're really worried about reward hacking or deceptive alignment.
Speaker 1
Ideally, those methods would just work quite well. And so people would be like, sure.
I mean, they also address a bunch of other more mundane problems. So why would we not use them?
Speaker 1 Which I think is like, that's sort of the good story.
Speaker 1 The good story is you develop methods that address a bunch of existing problems because they just are more principled ways to train AI systems that work better.
Speaker 1 People adopt them, and then we are no longer worried about e.g. reward hacking or deceptive alignment.
Speaker 2 And to make this more concrete,
Speaker 2 tell me if this is the wrong way to paraphrase it. The example of something where it just makes a system better, so why not just use it?
Speaker 2 At least so far, it might be like RLHF, where we don't know if it generalizes, but so far, you know, it makes your chat GPT thing better, and you can also use it to make sure that chat GPT doesn't tell you how to make a bioweapon.
Speaker 2 So, yeah, it's not an extra tax.
Speaker 1 Yeah, so I think this is right in the sense that using RLHF is not really a tax. If you wanted to deploy a useful system, why would you not? It's just very much worth the money of doing the training.
Speaker 4 And then,
Speaker 1 yeah, so RLHF will address certain kinds of alignment failures. That is like where system just doesn't understand or is changing X-word prediction.
Speaker 1 It's like, this is the kind of context where a human would do this wacky thing, even though it's not what we'd like. There's some very dumb alignment failures that will be addressed by it.
Speaker 1 But I think mostly, yeah, the question is, is that true even for the sort of more challenging alignment failures that motivate concern in the field?
Speaker 1 I think RLHF doesn't address most of the concerns that motivate.
Speaker 1 motivate people to be worried about alignment. Yeah.
Speaker 2 I'll let the audience look up what RLHF is if they don't know. It'll just be more simpler to just look it up than explain right now.
Speaker 2 Okay, so this seems like a good jumping off point to talk about the mechanism or the research you've been doing to that end.
Speaker 2 Explain it as you might to a child.
Speaker 1 Yeah, so the high-level, I mean, there's a couple different high-level descriptions you could give, and maybe I will unwisely give like a couple of them in the hopes that one is kind of makes sense.
Speaker 1 A first pass is like, it would sure be great to understand
Speaker 1
like why models have the behaviors they have. So you like look at GPT-4.
If you ask GPT-4 a question, it will say something that like, you know, looks very polite.
Speaker 1 And if you ask it to take an action, it will take an action that doesn't look dangerous.
Speaker 1 You will decline to do a coup, whatever, all this stuff.
Speaker 1 I think you'd really like to do is look inside the model and understand why it has those desirable properties.
Speaker 1 And if you understood that, you could then say like, okay, now can we flag when these properties are at risk of breaking down or predict how robust these properties are,
Speaker 1 determine if they hold in cases where it's too confusing for us to tell directly by asking if like the underlying cause is still present. So, that's like a thing people would really like to do.
Speaker 1 Most work aimed at that long-term goal right now is just sort of opening up neural nets and doing some interpretability and trying to say, can we understand, even for very simple models, why they do the things they do, or what this neuron is for, or questions like this.
Speaker 1 So, ARC is taking a somewhat different approach where we're instead saying like, okay, look at these interpretability explanations that are made about models and ask like what are they actually doing?
Speaker 1 Like what is the type signature? What are like the rules of the game for making such an explanation? What makes like a good explanation?
Speaker 1 And probably the biggest part of the hope is that
Speaker 1 If you want to say detect when the explanation has broken down or something weird has happened, that doesn't necessarily require a human to be able to understand this like complicated interpretation of a giant model.
Speaker 1 If you understand like what is an explanation about or like what were the rules of the game, how are these constructed, then you might be able to sort of automatically discover such things and automatically determine if right on a new input it might have broken down.
Speaker 1 So that's one way of sort of describing the high-level goal.
Speaker 1 You can start from interpretability and say like can we formalize this activity or like what a good interpretation or explanation is.
Speaker 1 There's some other work in that genre, but I think we're just taking like a particularly ambitious approach to it.
Speaker 2 Yeah, let's dive in. So okay, well, what is a good explanation?
Speaker 1
You mean what is this kind of criterion? At the end of the day, we kind of want some criterion. And the way the criterion should work is like, you have your neural net.
Yeah.
Speaker 1 You have some behavior of that model. Like a really simple example is like
Speaker 1 Anthropic has this sort of informal description being like, here's induction, like the tendency that if you have like the pattern AB followed by A, it will tend to predict B.
Speaker 1 You can give some kind of words and experiments and numbers that are trying to explain that. And what we want to do is say like,
Speaker 1 what is like a formal version of that object? How do you actually test if such an explanation is good?
Speaker 1 So just clarifying what we're looking for when we say we want to define what makes an explanation good. And the kind of answer that we are searching for or settling on
Speaker 1
is saying this is kind of a deductive argument for the behavior. So you want to get given the weights of a neural net.
So it's just like a bunch of numbers.
Speaker 1 You've got your million numbers or billion numbers or whatever. And then you want to say, here's some things I can point out about the network and some conclusions I can draw.
Speaker 1 I can be like, well, look,
Speaker 1 these two vectors have large inner product and therefore these two activations are going to be correlated on this distribution.
Speaker 1 These are not established by drawing samples and checking things are correlated, but saying because of the weights being the way they are, we can proceed forward through the network and derive some conclusions about what properties the outputs will have.
Speaker 1 So you could think of this as like the most extreme form would be just proving that your model has this induction behavior.
Speaker 1 Like you could imagine proving that if I sample tokens at random with this pattern AB followed by A, the B appears 30% of the time or whatever.
Speaker 1 That's the most extreme form. And what we're doing is kind of just like relaxing the rules of the game for proof, saying proofs are like incredibly restrictive.
Speaker 1 I think it's unlikely they're going to be applicable to kind of any interesting neural net.
Speaker 1 But the thing about proofs that is relevant for our purposes isn't that they give you like 100% confidence. So you don't have to be like this incredible level of
Speaker 1 demand for rigor. You can relax like the standards of proof a lot and still get this feature where it's like a structural explanation for the behavior.
Speaker 1 We're like deducing one thing from another until at the end your final conclusion is like, therefore, induction occurs.
Speaker 2 Would it be useful to maybe motivate this by explaining what the problem with normal mechanistic interpretation is? So
Speaker 2 you mentioned induction heads. This is Anthropic found in two-layer transformers, where Anthropic noticed that in a two-layer transformer,
Speaker 2 there's a pretty simple circuit by which if AB happens in the past, then the model knows that if you see an A now, you do a B next.
Speaker 2 But that's a two-layer transformer. So
Speaker 2 we have these models that have hundreds of layers that have trillions of parameters. Okay, anyways, what is wrong with mechanistic interpretability?
Speaker 1 Aaron Ross Powell, yeah, so I think, I mean, I like mechanistic interpretability quite a lot.
Speaker 1 And I do think like if you just consider the entire portfolio of what people are working on for alignment, I think there should be more work on mechanistic interpretability than there is on this project ARC is doing.
Speaker 1
But I think that's the case. So I think we're mostly talking about, yeah, I think we're kind of a small fraction of the portfolio.
And I think it's like a good enough bet.
Speaker 1 It's quite a good bet overall.
Speaker 1 But so the thing that I, like, the problem we're trying to address with mechanistic interpretability is kind of like, if you do some interpretability and you explain some phenomenon, you face this question of like, what does it mean your explanation was good?
Speaker 1 Like if you want to either, I think this is a problem like somewhat institutionally or culturally.
Speaker 1 Like it's just hard to know what you're doing and it's hard to scale up an activity when you like don't really understand the rules of the game for that activity very well.
Speaker 1 It's hard to have that much confidence in your results.
Speaker 2 The explanation being it outputted this because it wants to take over the world versus it outloaded this because it just wants to help you.
Speaker 2 Just make it concrete why the explanation matters.
Speaker 1 Yeah, so the ideal kind of outcome here would be to say you have your AI system behaving nicely. You get some explanation for sort of why it's behaving nicely.
Speaker 1 And we could tell a story in English about that explanation, but we're not actually imagining the explanation being a thing that makes sense to a human.
Speaker 1 But if you were to tell a story in English, which again you will not see as a researcher, it would be something like, well, then the model believes it's being trained.
Speaker 1 And so because it believes it's being trained, it knows it needs to not do anything scary looking or else the humans will penalize it.
Speaker 1 Like that's something that's happening inside this opaque explanation.
Speaker 1 And then the hope is if you have that explanation and then you run into a new input on which the model doesn't believe it's being trained, right?
Speaker 1 If you just look at the set of activations of your model, that is not necessarily a weird looking activation, right? It's just a bunch of numbers.
Speaker 1 But if you look at this explanation, you see, actually, the explanation really crucially depended on this fact holding consistently across the training distribution, which again, we as humans could editorialize and say that fact was it believes it's being trained.
Speaker 1 But the explanation doesn't fundamentally make reference to that. It's just saying, here's a property of the activations, which holds over the training set.
Speaker 1 And this property is responsible for the behavior, namely that it doesn't do anything that looks too dangerous.
Speaker 1 So then when a new input comes in and it doesn't satisfy that property, you can say, okay, this is anomalous with respect to that explanation.
Speaker 1 So either it will not have the behavior, like it won't, I mean, it will do something that appears dangerous, or maybe it will have that behavior, but for some different reason than normal.
Speaker 1 Normally, it does it because of this pathway, and now it's doing it for a different pathway.
Speaker 1 And so you would like to be able to flag that both there's a risk of not exhibiting the behavior, and if it happens, it happens for a weird reason.
Speaker 1 And then you could, I mean, at a minimum, when you encounter that, say, like, okay, raise some kind of alarm. There's sort of more ambitious, complicated plans for how you would use it.
Speaker 1 So it fits, or has some longer story, which is kind of what motivated this, of like how it fits into the whole rest of the plan.
Speaker 2 I just wanted to flag that because
Speaker 2 just so it's clear why the explanation matters.
Speaker 1 Yeah. And for this purpose, it's like the thing that's essential is kind of reasoning from one property of your model to the next property of your model.
Speaker 1 Like it's really important that you're going forward step by step rather than drawing a bunch of samples and confirming the property holds.
Speaker 1 Because if you just draw a bunch of samples and confirm the property holds, you don't get this check where you say, oh, here was the relevant fact about the internals that was responsible for this downstream behavior.
Speaker 1 All you see is like, yeah, we checked a million cases and it happened in all of them.
Speaker 1 You really want to see this like, okay, here was the fact about the activations, which kind of causally leads to this behavior.
Speaker 2 But explain why the sampling why it matters that you have the causal causal explanation.
Speaker 1 Primarily because of this like being able to tell like if things had been different like if you have an input where this doesn't happen then you shouldn't be scared.
Speaker 2 Even if the output is the same.
Speaker 1 Yeah. Or if the output or if it's too expensive to check in this case.
Speaker 1 And like to be clear, when we talk about like formalizing what is a good explanation, there is a little bit of work that pushes on this and it mostly takes this like causal approach.
Speaker 1 Well what should an explanation do? It should not only predict the output, it should predict how the output changes in responses to change in response to changes in the internals.
Speaker 1 So that's the most common approach to formalizing what is a good explanation.
Speaker 1 And even when people are doing informal interpretability, I think if you're publishing in an ML conference and you want to say this is a good explanation, the way you would verify that would, even if not a formal set of causal intervention experiments, it would be some kind of ablation where then we messed with the inside of the model and it had the effect which we would expect based on our explanation.
Speaker 2 Anyways, back to the problems of mechanistic interpretability.
Speaker 1 Yeah. Yeah, I mean, I guess this is this is relevant in the sense that
Speaker 1 I think the basic difficulty is you don't really understand the objective of what you're doing, which is like a little bit hard institutionally or like scientifically, it's just rough.
Speaker 1 It's better to like, it's easier to do science when the goal of the game is to predict something and you know what you're predicting than when the goal of the game is to like understand in some undefined sense.
Speaker 1 I think it's particularly relevant here just because like
Speaker 1 the informal standard we use involves like humans being able to make sense of what's going on and there's like some question about scalability of that.
Speaker 1 Like will humans recognize the concepts that models are using?
Speaker 1 Yeah, I think as you try and automate it, it becomes like increasingly concerning if you're on like slightly shaky ground about like what exactly you're doing or what exactly the standard for success is.
Speaker 1 I think there's like a number of reasons. As you work with really large models, it becomes just increasingly desirable to have a really robust sense of what you're doing.
Speaker 1 But I do think it would be better even for small models to have a clearer sense.
Speaker 2 The point you made about as you automate it, is it because whatever work the automated alignment researcher is doing, you want to make sure you can verify it?
Speaker 1 I think it's most of all like, so a way you can automate, I think how you would automate interpretability if you wanted to right now, is you take the process humans use, and you'd say, great, we're going to take that human process, train ML systems to do the pieces that humans do of that process, and then just do a lot more of it.
Speaker 1 So I think that is great as long as your task decomposes into like human-sized pieces.
Speaker 1 And then there's just this fundamental question about large models, which is like, do they decompose in some way into like human-sized pieces, or is it just a really messy mess with interfaces that aren't nice?
Speaker 1 And the more it's the latter type, the harder it is to break it down into these pieces which you can automate by copying what a human would do.
Speaker 1 And the more you need to say, okay, we need some approach which scales more structurally.
Speaker 1 But
Speaker 1 I think compared to most people, I am less worried about automating interpretability.
Speaker 1 I think if you have a thing which works that's incredibly labor-intensive, I'm fairly optimistic about our ability to automate it.
Speaker 1 Yeah. Again, the stuff we're doing I think is quite helpful in some worlds, but I do think that the typical case, like interpretability can add a lot of value without this.
Speaker 2 It makes sense what an explanation would mean in the language. This model is doing this because of
Speaker 2 whatever essay length thing. But
Speaker 2 you have trillions of parameters and you have all of these
Speaker 2 uncountable number of operations. What does an explanation of why an output happened even mean?
Speaker 1
Yeah, so to be clear, an explanation of why a particular output happened, I think, is just you ran the model. So we're not expecting a smaller explanation for that.
Right.
Speaker 1 So the explanations overall for these behaviors we expect to be of similar size to the model itself, maybe somewhat larger.
Speaker 1 And I think the type signature, if you want to have a clear mental picture, the best picture is probably talking about a proof or imagining a proof that a model has this behavior.
Speaker 1
So you could imagine proving the GPT-4 does this induction behavior. And that proof would be a big thing.
It would be much larger than the weights of the model.
Speaker 1 That's sort of our goal to get down from much larger to just the same size. And it would potentially be incomprehensible to a human, right?
Speaker 1 Just say, here's a direction activation space, and here's how it relates to this direction activation space. And so just pointing out a bunch of stuff like that.
Speaker 1 Like, here's these various directions, here's these various features constructed from activations, potentially even nonlinear functions. Here's how they relate to each other.
Speaker 1 And here's how, like, if you look at what the computation of the model is doing, like, you can just sort of inductively trace through and confirm that the output has such and such correlation.
Speaker 1 So that's the dream.
Speaker 1 Yeah, I think the mental reference would be, like, I don't really like proofs because I think there's such a huge gap between what you can prove. and how you would analyze a neural net.
Speaker 1 But I do think it's probably the best mental picture for what is an explanation even if a human doesn't understand it. We would regard a proof as a good explanation.
Speaker 1 And our concern about proofs is primarily that it's just you can't prove properties of neural nets, we suspect, although it's not completely obvious.
Speaker 1 I think it's pretty clear you can't prove facts of all of neural nets.
Speaker 2 You've detected all the reasons things happen in training. And then...
Speaker 2 If something happens for a reason you don't expect in deployment, then you have an alarm and you're like, let's make sure this is not because you want to make sure that it hasn't decided to take over or something.
Speaker 2 Now, but the thing is, on every single different input, it's going to have different activations.
Speaker 2 So there's always going to be a difference unless you run the exact same input. How do you detect whether this is,
Speaker 2 oh, it's just a different input versus an entirely different circuit that might be potentially deceptive has been activated?
Speaker 1 Yeah.
Speaker 1 I mean, to be clear, I think you probably wouldn't be looking at a separate circuit, which is part of why it's hard. You'd be looking at like, the model is always doing the same thing on every input.
Speaker 1
It's always, whatever it's doing, it's a single computation. So it'd be all the same circuits interacting in a surprising way.
But yeah, this is just to emphasize your question even more.
Speaker 1 I think the easiest way to start is to just consider the IID case. So where you're considering a bunch of samples, there's no change in distribution.
Speaker 1 You just have a training set of a trillion examples and then a new example from the same distribution.
Speaker 1 So in that case, it's still the case that every activation is different.
Speaker 1 But this is actually a very, very easy case to handle.
Speaker 1 So if you think about an explanation that generalizes across, if you have a trillion data points and an explanation which is actually able to compress the trillion data points down to like, actually it's kind of a lot of compression if you think think about it.
Speaker 1
If you have a trillion parameter model and a trillion data points, we would like to find a trillion parameter explanation in some sense. So it's actually quite compressed.
And
Speaker 1 just in virtue of being so compressed, we expect it to automatically work essentially for new data points from the same distribution.
Speaker 1 If every data point from the distribution was a whole new thing happening for different reasons, you actually couldn't have any concise explanation for the distribution.
Speaker 1 So this first problem is just like, it's a whole different set of activations. I think you're actually kind of okay.
Speaker 1 And then the thing that becomes more messy is like, but the real world will not only be new samples of different activations, they will also be different in important ways.
Speaker 1 Like the whole concern was there's these distributional shifts, or like not the whole concern, but most of the concern.
Speaker 1 I mean, maybe the point of having these explanations, I think every input is an anomaly in some ways, which is kind of the difficulty is if you have a weak notion of anomaly, any distribution shift can be flagged as an anomaly, and it's just like you're constantly getting anomalies.
Speaker 1 And so the hope of having such an explanation is to be able to say, like, here were the features that were relevant for this explanation or for this behavior.
Speaker 1 And like, a much smaller class of things are anomalies with respect to this explanation. Most anomalies wouldn't change this.
Speaker 1 Most ways you change your distribution won't affect the validity of this explanation. For example, this explanation is saying models will tend to activate in the following direction.
Speaker 1 You don't care about anything that's happening orthogonal to that direction. You're just like, are they not activating?
Speaker 1 You're sort of just looking at this one direction and being like, did this one direction change a lot?
Speaker 1 Yeah, so the idea is once you have this explanation at hand, a much, much smaller class of things look anomalous in a way that's relevant to the explanation.
Speaker 1 And if you've done a really good job, it's kind of the story is like: if there's a new input where you expect the property to still hold, that will be because you expect the explanation to still hold.
Speaker 1 Like the explanation generalizes as well as the behavior itself that it's explaining. It's kind of what you would like, roughly.
Speaker 2 Okay, so what happens if even in training it is deceptive and it has basically its internal monologue is, I will only rebel or I will only do something that looks bad if I can prove that I am in the real world and deployed.
Speaker 2 And presumably, since it's trained on all the web texts, it knows what like real world means, right? It's not in control of factories and robot armies or whatever.
Speaker 2 So
Speaker 2 in that case, even in training, it will have
Speaker 2 those activations for
Speaker 2 being fucked up on because it's in the back of its mind is thinking,
Speaker 2 I will take over once I have the opportunity. So you wouldn't detect anything anomalous once it is deployed and is still thinking those those treacherous thoughts.
Speaker 2 Or do you just think this is not a likely scenario or would you be able to?
Speaker 1
I think that's sort of the typical case you would like to detect. Yeah.
So the story would be something like, you want to explain on the training distribution why the model never like stabs anyone.
Speaker 1 Yeah. And you're like, what's my explanation on the training distribution? It's actually going to have to go through that model.
Speaker 1 Like supposing the only reason it doesn't stab people is it's like, well, I know that I'm being watched and I can't stab people unless I'm really not being watched.
Speaker 1 Then like your explanation for why does it never stab people is like, oh look, it knows, I mean, it won't be edited, it it won't look like annotated with English because it's just some giant basket of numbers But it's tracking along with the inner monologue and saying like well it thinks to itself am I being watched and like in fact it normally concludes it is being watched on the training distribution right you have this whole distribution and so the explanation is saying it knows it's being watched and therefore it decides not to stab anyone so ideally that's what the explanation does like a not annotated English but just like a giant bag of numbers that's effectively talking about that process so then you have a new input on which the model like says, am I on the training distribution?
Speaker 1 Like, no, I'm not being watched.
Speaker 1 Then if you compare that against the explanation, the explanation explanation said, like, hey, the reason it doesn't stab anyone is because it just constantly thinks that it's being trained.
Speaker 1 Like, this thing is always, yeah, this non-linear feature, the representation of am I being trained, is constantly saying yes.
Speaker 1 That's the explanation for why it never stabs people in the training distribution.
Speaker 1 And then you, like, on this test time, when you run it on the new input, it's like, does I think I'm on the training distribution? It says no. You compare that against your explanation.
Speaker 1 You're like, actually, the explanation breaks down completely now.
Speaker 1 The explanation said it was precisely because it thought it was being trained, that it doesn't do anything really dangerous or doesn't stab anyone.
Speaker 1 And so once we see that single variable switch, we know that this is not just an anomaly, this is the anomaly that matters or is like an anomaly that is very relevant to the explanation.
Speaker 2 So this would require it to have this,
Speaker 2 what is that cone that dogs have? But basically it would require this to have that MRI on it all the time.
Speaker 2 When it's deployed on every single call, you need to be running this loop of like, is this the explanation still holding?
Speaker 1 Yeah, for this story of like flagging an anomaly, you have to actually check at a sufficiently high frequency.
Speaker 1 So you could imagine checking all the time, or you could imagine saying like hey actually like you need a lot of ai systems cooperating to like not blow the whistle on each other or whatever like e.g.
Speaker 1 you might have checked like hey
Speaker 1 in training every time something suspicious happens the model flags it like i mean like why does that happen you have your explanation you look at a new case you could say is it not flagging it for the normal reason or is it a weird reason um so you could hope that like
Speaker 1 Even if a small fraction, even if you're running a small fraction of cases, if something bad is happening at scale, you can detect that.
Speaker 1 If you imagine one in every thousand decisions, you check if there's like a robot rebellion going on or whatever, you get to look in on one out of a thousand decisions.
Speaker 1 You see that something is really weird with that decision. And even if you didn't check many of them, you're like, okay, we need to back off.
Speaker 2 What happens if the AI, I mean, if it's standing on web tech, it includes your papers, or maybe just like reduces your first principles that this is an alignment technique one might try.
Speaker 2
And then it says, okay, well, I'm going to engineer my activations in a way. I'm trying to imagine how we might do this.
But basically,
Speaker 2 even in deployment, I'm going to make my
Speaker 2 thought process so subtle that it doesn't raise any alarms. Like, is that a possibility?
Speaker 1
Yeah. So, I think the hope is to basically be robust.
So, like, you know, when we're doing theoretical research, we are in a kind of worst-casey regime.
Speaker 1 So, we are normally imagining that everything conspires insofar as it can. So, insofar as, yeah, could a brain be like doing some crazy thing where it tries to structure itself to look weird?
Speaker 1 That's totally in scope for ruling something out.
Speaker 1 And I guess the hope there is like, it's actually not that easy to tell a story. Like, if you're like, why did you not stab people on training time? Why are you not stabbing?
Speaker 1 Like, what's that applying now? It's kind of, a mind doesn't like get to arbitrarily change the actual reasons it does something. Like, there was some earlier step.
Speaker 1 Maybe you could think of it as like at the first step where it's like, now I'm going to try and do the sneaky thing to make my thoughts look this way.
Speaker 1 Like the actual explanation, like the actual causal explanation for the behavior is like there.
Speaker 1 Like if you tried to give this reason, it would have to, like, maybe we should talk about the example in more detail to try and imagine like how a brain would try and do this or how a mind would try and do this.
Speaker 1 But I think it is like.
Speaker 2 Let's say it knows it has this plan. So in training, it
Speaker 2 keeps the part of
Speaker 2 like it activates a part of the...
Speaker 1 Yeah, but.
Speaker 1 Yeah, the problem is if it ever does a check that's like, am I in training or not? then your explanation for the behavior is going to have to like factor through that check.
Speaker 1 And so like if the check, like if you ever do the check, the check becomes like an essential part of the explanation. And then when the check fails, your explanation breaks down.
Speaker 1 So you've kind of like the you've already lost the game if you did such a check.
Speaker 2 Is there a way it could make it seem like
Speaker 2 because if on every different input it has different activations, then can it make it seem like the different activations were a result of just a different input and not because some internal check went off?
Speaker 1 Yeah, again, the hope is like it needs to it needs to not only have them be like random different inputs, it needs to like produce inputs with certain properties.
Speaker 1 Like it needs to deliberately create inputs that have this effect in the world, this intended effect in the world.
Speaker 2 So it basically creates a jailbreak for itself in some sense.
Speaker 1 Yeah, again, the question is like, it has to at test time be somehow producing activations that have the like takeover effect.
Speaker 1 And at training time, be producing activations that don't look that takeovery or whatever. And so like somehow there was a difference between those.
Speaker 1 It has to actually like, it has to really do the work of creating that difference somewhere.
Speaker 2 Can you make more concrete for us what an explanation would look like? Like is it physically, what does the explanation look like? Is it just a whole bunch of numbers itself?
Speaker 2 Is it a language? Is it code? What does the explanation look like?
Speaker 1 Yeah, I also want to maybe step back a tiny bit and clarify that, like, I think this project is kind of crazily ambitious.
Speaker 1 And the main reason, like the overwhelming reason I think you should expect it to break down or fail is just because
Speaker 1
we have all these desires. We have all the things we want out of this notion of explanation.
But that's an incredibly hard research project, which has a reasonable chance of being impossible.
Speaker 1 So I'm happy to talk about what the... what the implications are, but I want to flag that
Speaker 1 condition on failing, I think it's most likely because just like the things we wanted were either incoherent or intractably difficult.
Speaker 2 But what are the odds you think you'll succeed?
Speaker 1 I mean, it depends a little bit what you mean by succeed. But if you say get explanations that are like
Speaker 1 great and accurately reflect reality and work for all of these applications that we're imagining or that we are optimistic about, like kind of the best case success,
Speaker 1 I don't know, like 10, 20%, something at a ballpark.
Speaker 1 And then there's a higher probability of various intermediate results that provide value or insight without being the whole dream.
Speaker 1 But I think the probability of succeeding in the sense of realizing the whole dream is quite low.
Speaker 1 Yeah, in terms of what explanations look like physically, or like the most ambitious plan, the most optimistic plan is that you are searching for explanations in parallel with searching for neural networks.
Speaker 1 So you have a parametrization of your space of explanations, which mirrors the parametrization of your space of neural networks.
Speaker 1 Or it's like you should think of it as kind of similar to like, what is a neural network? It's some like simple architecture where you fill in a trillion numbers and that specifies how it behaves.
Speaker 1 So too, you should expect an explanation explanation to be like a pretty flexible, like, general skeleton that's saying, like, yeah, a pretty flexible general skeleton, which just has a bunch of numbers you fill in.
Speaker 1 And, like, what you are doing to produce an explanation is primarily just filling in these floating-point numbers.
Speaker 2 You know, when we conventionally think of explanations, you know, if you think of the explanation for why the universe
Speaker 2 moves this way, it wouldn't be something that you could discover on some smooth evolutionary surface where you can, you know, climb up the hill towards the laws of physics.
Speaker 2
It's just like a very, these are the laws of physics. You kind of just derive them for first principles.
So, in, but in this case,
Speaker 2 it's not like a just a bunch of correlations between the orbits of different planets or something.
Speaker 2 Maybe the word explanation has a different. I don't even, I didn't even ask the question, but maybe you can just like speak to that.
Speaker 1 Yeah, I mean, I think I basically sympathetic. This is like there's some intuitive objections.
Speaker 1 Like, look, the space of explanations is this like, this like rigid logical, like a lot of explanations have this rigid logical structure where they're really precise and like simple things govern like complicated systems and like nearby simple things just don't work and so on and like a bunch of things that feel totally different from this kind of nice continuously parametrized space.
Speaker 1 And you can imagine interpretability on simple models where you're just by gradient descent finding feature directions that have desirable properties.
Speaker 1 But then when you imagine like, hey, now that's like a human brain you're dealing with that's like thinking logically about things, the explanation of why that works isn't going to be just like, here were some feature directions.
Speaker 1 So that's how I understood the basic confusion which I share.
Speaker 1 or sympathize with at least.
Speaker 1 So I think the most important high-level point is like, I think basically the same objection applies to being like, how is GPT-4 going to learn to reason like logically about something?
Speaker 1 You're like, well, look, logical reasoning, that's like, it's got rigid structure.
Speaker 1 It's like, it's doing all this like, you know, it's doing ands and ors when it's called for, even though it just somehow optimized over this continuous space. And like
Speaker 1 the difficulty or the hope is that the difficulty of these two problems are kind of like matched.
Speaker 1 So that is, it's very hard to like find these like logical-ish explanations because it's not a space that's easy to search over. But there are like ways to do it.
Speaker 1 Like there's ways to embed discrete, complicated, rigid things in these nice, squishy, continuous spaces that you search over.
Speaker 1 And in fact, to the extent that neural nets are able to learn the rigid logical stuff at all, they learn it in the same way.
Speaker 1 That is, maybe they're hideously inefficient, or maybe it's possible to embed this discrete reasoning in the space in a way that's not too inefficient.
Speaker 1
But you really want the two search problems to be of similar difficulty. And that's the key hope overall.
I mean, this is always going to be the key hope.
Speaker 1 The question is, is it easier to learn a neural network or to find the explanation for why the neural network works?
Speaker 1 I think people have the strong strong intuition that it's easier to find the neural network than the explanation of why it works.
Speaker 1 And that is really the, I think we, or at least exploring the hypothesis or interested in the hypothesis, that maybe those problems are actually more matched in difficulty.
Speaker 2 And why might that be the case?
Speaker 1 This is pretty conjectural and complicated.
Speaker 1 To express some intuitions, like
Speaker 1 maybe one thing is like, I think a lot of this intuition does come from cases like machine learning.
Speaker 1 So if you ask about writing code and you're like, how hard is it to find code versus find the explanation the code is correct?
Speaker 1 in those cases there's actually just like not that much of a gap like the way a human writes a code is basically the same difficulty as finding the explanation for why it's correct in the case of ML like
Speaker 1 I think we just mostly don't have empirical evidence about how hard it is to find explanations of this particular type about why models work like we have a sense that it's really hard but that's because we're like have this incredible mismatch where like gradient descent is spending an incredible amount of compute searching for a model and then some human is like looking at activate like looking at neurons or even some neural net is looking at neurons just like
Speaker 1 you have an incredible, basically, because you cannot define what an explanation is, you're not applying gradient descent to the search for explanations.
Speaker 1 So, I think the ML case just like actually shouldn't make you feel that pessimistic about the difficulty of finding explanations.
Speaker 1 Like, the reason it's difficult right now is precisely because you don't have like any kind of, you're not doing an analogous search process to find this explanation as you do to find the model.
Speaker 1 That's just like a first part of the intuition. Like, when humans are actually doing design, I think there's not such a huge gap.
Speaker 1 When in the ML case, I think like there is a huge gap, but it's, I think, largely for other reasons.
Speaker 1 A thing I also want to stress is that
Speaker 1 we just are open to there being a lot of facts that don't have particularly compact explanations.
Speaker 1 So another thing is when we think of finding an explanation, in some sense, we're setting our sites really low here.
Speaker 1 So if a human designed a random widget and was like, this widget appears to work well.
Speaker 1 Or if you search for a configuration that happens to fit into this spot really well, it's like a shape that happens to mesh with another shape.
Speaker 1 You might be like, what's the explanation for why those things mesh? And we're very open to just being like, that doesn't need an explanation. You just compute.
Speaker 1 You check that the shapes mesh and like you did a billion operations and you check this thing worked.
Speaker 1 Or you're like, why do these proteins bind? You're like, it's just because these shape, like this is a low energy configuration.
Speaker 1 And there's like not, we're very open to, in some cases, there's not very much more to say. So we're only trying to explain cases where like kind of the surprise intuitively is very large.
Speaker 1 So for example, if you have a neural net that gets a problem correct, you know, a neural net with a billion parameters that gets a problem correct on every input of length 1,000.
Speaker 1 In some sense, there has to be something that needs explanation there because there's like too many inputs for that to happen by chance alone.
Speaker 1 Whereas if you have a neural net that gets something right on average or gets something right in merely a billion cases, that actually can just happen by coincidence.
Speaker 1 GPT-4 can get billions of things right by coincidence because it just has so many parameters that are adjusted to fit the data.
Speaker 2 So a neural net that is initialized completely randomly, the explanation for that would just be the neural net itself.
Speaker 1
Well, it would depend what behaviors it had. So we're always like talking about an explanation of some behavior from a model.
Right.
Speaker 2 And so it just has a whole bunch of random behaviors. So there'll just be like an exponentially large explanation relative to the weights of the model.
Speaker 1 Yeah, there aren't. I mean, I think there just aren't that many behaviors that demand explanation.
Speaker 1 Like most things a random neural net does are kind of what you'd expect from like a random, you know, if you treat it just like a random function, then there's nothing to be explained.
Speaker 1 There are some behaviors that demand explanation, but like, I mean, yeah. Anyway, random neural net's pretty uninteresting.
Speaker 1 It's pretty, like, that's part of the hope is it's kind of easy to explain features of the random neural net.
Speaker 2 Okay, so this is interesting. So the smarter or more ordered the neural network is, the more compressed the explanation.
Speaker 1 Well, it's more like the more interesting the behaviors to be explained. So the random neural net just doesn't have very many interesting behaviors that demand explanation.
Speaker 1 And as you get smarter, you start having behaviors that are like, you know, you start having some correlation with the simple thing, and then that demands explanation.
Speaker 1 Or you start having some regularity in your outputs, and then that demands explanation. So these properties kind of emerge gradually over the course of training that demand explanation.
Speaker 1 I also, again, want to emphasize here that when we're talking about searching for explanations, this is like, this is some dream.
Speaker 1 We talk to ourselves, like, why would this be really great if we succeeded? We have no idea about the empirics on any of this.
Speaker 1 So these are all just words that we think to ourselves and sometimes talk about to understand: would it be useful to find a notion of explanation and what properties would we like this notion of explanation to have?
Speaker 1 But this is really like speculation and being out on a limb. Almost all of our time, day to day, is just thinking about cases much, much simpler, even than small and rural nets, or like
Speaker 1 yeah, thinking about very simple cases and saying, like, what is the correct notion?
Speaker 1 Like, what is like the right heuristic estimate in this case, or like, how do you reconcile these two apparently conflicting explanations?
Speaker 2 Is there a hope that you could,
Speaker 2 if you have a different way to make proofs now, that you can actually have heuristic arguments where instead of having to prove
Speaker 2 the Riemann hypothesis or something, you can come up with a probability of it in a way that is compelling and you can publish?
Speaker 2 So would it just be a new way to do mathematics, a completely new way to prove things in mathematics?
Speaker 1 So I think most claims in mathematics that mathematicians believe to be true already have fairly compelling heuristic arguments. It's like the Riemann hypothesis, it's actually just,
Speaker 1 there's kind of a very simple argument that the Riemann hypothesis should be true unless something surprising happens.
Speaker 1 And so like a lot of math is about saying like, okay, we did a little bit of work to find the first past explanation of why this thing should be true.
Speaker 1 And then like, for example, in the case of the Riemann hypothesis, the question is like, do you have this like weird periodic structure in the primes?
Speaker 1 And you're like, well, look, if the primes were kind of random, you obviously wouldn't have any structure like that. Like, just how would that happen? And then
Speaker 1 you're like, well, maybe there's something. And then the whole activity is about searching for like, can we rule out anything?
Speaker 1 Can we rule out any kind of conspiracy conspiracy that would break this result? So I think the mathematicians just wouldn't be very surprised or wouldn't care that much.
Speaker 1 And this is related to the motivation for the project.
Speaker 1 But I think just in a lot of domains, in a particular domain, people already have norms of reasoning that work pretty well and match roughly how we think these heuristic arguments should work.
Speaker 2 But it would be good to have more concrete sense. If you could say instead of, well, we think RSA is fine to being able to say, here's the probability that RSA is fine.
Speaker 1 Yeah, my guess is these will not, like, the estimates you get out of this would be be much, much worse than the estimates you'd get out of like just normal empirical or scientific reasoning, where you're like using a reference class and saying, how often do people find algorithms for hard problems?
Speaker 1 Like, I think what this argument will give you for is RSA fine is going to be like, well, RSA is fine unless it isn't.
Speaker 1 Like, unless there's some additional structure in the problem that an algorithm can exploit, then there's no algorithm.
Speaker 1 But very often, the way these arguments work, so for neural nets as well, is you say, look, here's an estimate about the behavior.
Speaker 1 And that estimate is right unless there's another consideration we've missed.
Speaker 1 And the thing that makes them so much easier than proofs is to just say, here's a best guess given what we've noticed so far. But that best guess can be easily upset by new information.
Speaker 1 And that's both what makes them easier than proofs, but also what means they're just way less useful than proofs for most cases.
Speaker 1 I think neural nets are kind of unusual in being a domain where we really do want to do systematic formal reasoning, even though we're not trying to get a lot of confidence.
Speaker 1 We're just trying to understand even roughly what's going on.
Speaker 2 But the reason this works for alignment, but isn't that interesting for the Riemann hypothesis? Where if in the RSA case, you say, well, the RSA is fine unless it isn't,
Speaker 2
unless this estimate is wrong. It's like, well, okay, well, it would tell us something new.
But in the alignment case, if the estimate is,
Speaker 2 this is what the output should be, unless there's some behavior I don't understand, you want to know in the case unless there's some behavior you don't understand. That's not like, oh, whatever.
Speaker 2 That's like, that's the case in which it's not aligned.
Speaker 1 Yeah, I mean, maybe one way of putting it is just like, we can wait until we see this input, or like you can wait until we see a weird input and say, okay, did this weird input do something we didn't understand?
Speaker 1 And for RSA, that would just be a trivial test. You're just like, some Christian algorithm would be like, is it a thing?
Speaker 1 Whereas for a neural net, in some cases, it is either very expensive to tell, or it's like you actually don't have any other way to tell.
Speaker 1 Like, you checked in easy cases, and now you're on a hard case, so you don't have a way to tell if something has gone wrong.
Speaker 1 Also, I would clarify that, like, I think it is interesting for the Riemann hypothesis. I would say, like,
Speaker 1 the current state, particularly in number theory, but maybe in quite a lot of math, is like there are informal heuristic arguments for pretty much all the open questions people work on um but those arguments are completely informal so that is like i think there isn't it's not the case that there's like here's the here are the norms of informal reasoning or the norms of heuristic reasoning and then we have we have arguments that like a heuristic argument verifier could accept it's just like people wrote some words i think those words like
Speaker 1 My guess would be like, you know, 95% of the things mathematicians accept as like really compelling filling heuristic arguments are correct.
Speaker 1
And like if you actually formalize them, you'd be like, some of these aren't quite right. Or here's some corrections.
Or here's which of two conflicting arguments is right. right.
Speaker 1 I think there's something to be learned from it. I don't think it would be like mind-blowing, though.
Speaker 2 When you have it completed, how big would this heuristic estimator, the rules for this heuristic estimator be?
Speaker 2
I mean, I know like when Russell and who was the other guy when they did the rules for logic. Yeah, yeah.
Wasn't it like literally they had like a bucket or a wheelbarrow with all the papers?
Speaker 2 But how big would a...
Speaker 1 I mean, mathematical foundations are quite simple in the end. Like at the end of the day, it's like, you know, how many symbols?
Speaker 1 Like, I don't know, it's hundreds of symbols or something that go into the entire foundations.
Speaker 1 And the entire rules of reasoning for like, you know, those are sort of built on top of first order logic.
Speaker 1 But the rules of reasoning for first order logic are just like, you know, another hundreds of symbols or 100 lines of code or whatever.
Speaker 1
I'd say like, I have no idea. Like, we are certainly aiming at things that are just not that complicated.
Like, and my guess is that the algorithms we're looking for are not that complicated.
Speaker 1 Like, most of the complexity is pushed into arguments, not in this verifier or estimator.
Speaker 2 So for this to work, you need to come up with an estimator, which is a way to integrate different heuristic arguments together.
Speaker 1
It has to be a machine that takes this input. First, it takes an input argument and decides what it believes in light of it.
Just kind of like saying, was it compelling? But
Speaker 1 second,
Speaker 1 it needs to take four of those and then say, here's what I believe in light of all four, even though there's a different estimation strategies that produce different numbers.
Speaker 1
And that's a lot of our life is saying, well, here's a simple thing that seems reasonable. And here's a simple thing that seems reasonable.
What do you do?
Speaker 1
And there's supposed to be a simple thing that unifies them both. And the obstruction to getting that is understanding what happens when these principles are slightly in tension.
And and
Speaker 1 how do we deal.
Speaker 2 Yeah, that seems super interesting.
Speaker 2 We'll see what other applications it has. I don't know, like computer security and code checking, if you can put bound, like actually say like this is how safe we think a code is in a very formal way.
Speaker 1 My guess is we're not going to add, I mean, this is both a blessing and a curse. Like it's a curse and you're like, well, it's sad that your thing is not that useful, but a blessing in that like...
Speaker 1
not useful things are easier. My guess is we're not going to add that much value in most of these domains.
Like most of the difficulty comes from like
Speaker 1 a lot of code that you'd want to verify, not all of it, but a significant part is just the difficulty of formalizing the proof is the hard part and actually getting all of that to go through.
Speaker 1 And we're not going to help even the tiniest bit with that, I think. So this would be more helpful if you have code that uses simulations.
Speaker 1 You want to verify some property of a controller that involves some numerical error or whatever. You need to control the effects of that error.
Speaker 1 That's where you start saying, well, heuristically, if the errors are independent, blah, blah, blah.
Speaker 2 Yeah, you're too honest to be a salesman, Paul.
Speaker 1 I mean, this is kind of like sales to us, right? Like, if you talk about this idea, people are like, why would would that not be like the coolest thing ever and therefore impossible?
Speaker 1
And we're like, well, actually, it's kind of lame. And we're just trying to pitch.
Like, it's way lamer than it sounds. And that's really important to why it's possible.
Speaker 1 It's being like, it's really, it's really not going to blow that many people's. I mean, I think it will be cool.
Speaker 1 I think it will be like very, if we succeed, it will be very solid, like meta-mathematics or theoretical computer science or whatever. But I don't think.
Speaker 1 Right again, I think the mathematicians already do this reasoning and they mostly just love proofs. I think the physicists do a lot of this reasoning, but they don't care about formalizing anything.
Speaker 1 I think like in practice, other difficulties are almost always going to to be more salient. I think this is of most interest by far for interpretability in ML.
Speaker 1 And I think other people should care about it and probably will care about it if successful, but I don't think it's going to be the biggest thing ever in any field or even that huge a thing.
Speaker 1 I think this would be a terrible career move given the
Speaker 1 ratio of difficulty to
Speaker 1
impact. I think theoretical computer science, it's probably a fine move.
I think in other domains, it just wouldn't be worth it.
Speaker 2 We're going to be working on this for years at least in the best case i'm laughing because my next question was gonna be like a setup for you to explain if this like some grad student wants to work on this
Speaker 1 i think theoretical computer science is an exception where i think this is like in some sense like what the best of theoretical computer science is like so like you have all this reason you have this like because it's useless
Speaker 1 i mean i think
Speaker 1 Like an analogy, I think like one of the most successful sagas in theoretical computer science is like formalizing the notion of an interactive proof system.
Speaker 1 And it's like you have some kind of informal thing that's interesting to understand and you want to like pin down what it is and construct some examples and see what's possible and what's impossible.
Speaker 1 And this is like, I think this kind of thing is the bread and butter of like the best parts of theoretical computer science.
Speaker 1 And then again, I think mathematicians, like it may be a career mistake because the mathematicians only care about proofs or whatever, but that's a mistake in some sense aesthetically.
Speaker 1 Like if successful, I do think looking back, and again, part of why it's a mistake is such a high probability, we wouldn't be successful.
Speaker 1 But I think looking back, people would be like, that was pretty cool.
Speaker 1 Like, although not that cool, or like, we understand why it didn't happen, given like the epistemic, like, what people cared about in the field, but it's pretty cool now.
Speaker 2 But, but, isn't it also the case that didn't hardy write in uh that like in all this prime shit is both uh not useless but it's fun to do and like it turned out that all the cryptography is based on all that prime shit um so i don't know it could have but anyways i'm trying to set you up so that you can uh tell um and and forget about if it doesn't have applications in all those other fields, it matters a lot for alignment.
Speaker 2 And that's why I'm trying to set you up to talk about if,
Speaker 2 you know, the smart, I don't know, math, I think a lot of smart people listen to this podcast.
Speaker 2 If they're a math or CS grad student
Speaker 2 and has gotten interested in this, are you looking to potentially find talent to help you with this? Yeah, maybe we'll start there.
Speaker 2 And then I also want to ask you if, I think also, maybe people who who can provide funding might be listening to the podcast. So
Speaker 2 to both of them, what is your pitch?
Speaker 1
Yeah, so we're definitely, definitely hiring and searching for collaborators. Yeah.
I think the most useful profile is
Speaker 1 probably a combination of like intellectually interested in this particular project and motivated enough by alignment to work on this project, even if it's really hard.
Speaker 1 I think there are a lot of good problems.
Speaker 1 So the basic fact that makes this problem unappealing to work on, I'm a really good salesman, but I think the only reason this isn't a slam dunk thing to work on is that like
Speaker 1
there are not great examples. So we've been working on it for a while, but we do not have beautiful results as of the recording of this podcast.
Hopefully by the time it airs, you'll have to do it.
Speaker 1 That's all the little subscript that's like, they've had great results since then.
Speaker 2 But it was
Speaker 2 too long to put in the margins of the podcast.
Speaker 1 Yeah.
Speaker 1 With luck.
Speaker 1
Yeah, so I think it's hard to work on because it's not clear what a success looks like. It's not clear if success is possible.
But But I do think there's a lot of questions.
Speaker 1 We have a lot of questions.
Speaker 1 And
Speaker 1 I think the basic setting of, look, there are all of these arguments. So in mathematics, in physics, in computer science, there's just a lot of examples of informal heuristic arguments.
Speaker 1 They have enough structural similarity that it looks very possible that there is like a unifying framework, that these are instances of some general framework and not just a bunch of random things.
Speaker 1 Like not just a bunch of, it's not like, so for example, for the prime numbers, people reason about the prime numbers as if they were like a random set of numbers.
Speaker 1 One view is like, that's just a special fact about the primes. They're kind of random.
Speaker 1 A different view is like, actually, it's pretty reasonable to reason about an object as if it was a random object as a starting point.
Speaker 1 And then as you notice structure, like revised from that initial guess. And it looks like, to me, the second perspective is probably more right.
Speaker 1 It's just like reasonable to start off treating an object as random and then like notice perturbations from random, like notice structure the object possesses.
Speaker 1
And the primes are unusual in that they have fairly little like additive structure. I think it's a very natural theoretical project.
There's like a bunch of activity that people do.
Speaker 1 It seems like there's a reasonable chance it has some, there's something nice to say about unifying all of that activity. I think it's a pretty exciting project.
Speaker 1 The basic strike against it is that it seems really hard. Like if you were someone's advisor, I think you'd be like, what are you going to prove if you work on this for the next two years?
Speaker 1 And they'd be like, There's a good chance nothing. And then like it's not what you do if you're a PhD student normally.
Speaker 1 You aim for those high probabilities of getting something within a couple years.
Speaker 1 The flip side is it does feel, I mean, I think there are a lot of questions. I think some of them we're probably going to make progress on.
Speaker 1 So, like, I think the pitch is mostly like, are some people excited to get in now?
Speaker 1 Or are people more like, ah, let's wait to see, like, once we have one or two good successes to see what the pattern is and become more confident we can turn the crank to make more progress in this direction.
Speaker 1 But for people who are excited about working on stuff with reasonably high probabilities of failure and not really understanding exactly what you're supposed to do,
Speaker 1 I think it's a pretty good project.
Speaker 1 I feel like if people look back, if we succeed and people are looking back in like 50 years on like what was the coolest stuff happening in math or theoretical computer science, there will be like a reasonable, this will definitely be like in contention.
Speaker 1 And I would guess for lots of people would just seem like the coolest thing from this period of a couple of years or whatever.
Speaker 2 Right. Because this is a new method in
Speaker 2 so many different fields from the ones you met in physics, math, theoretical computer science.
Speaker 2 That's really, I don't know, Mamo, because what is the average math PhD working on? Right? He's not, he's working on like some
Speaker 2 subset of a subset of something I can't even understand or pronounce. But
Speaker 1 math is quite quite esoteric.
Speaker 2 But yeah, this seems like, I don't know, even the small chance of it working, like forget about the value for, you shouldn't forget about the value for alignment.
Speaker 2 But even without that, this is such a cool, if this works, it's like a really big, it's a big deal.
Speaker 1 There's a good chance that if I had my current set of views about this problem and didn't care about alignment and had the career safety to just like spend a couple of years thinking about it or spend half my time for like five years or whatever, that I would just do that.
Speaker 1 even without caring at all about alignment. It's just a nice, it's a very nice problem.
Speaker 1 It's very nice to have this like library of things that that succeed, where it's just, they feel so tantalizingly close to being formalizable, at least to me, and
Speaker 1 such a natural setting, and then just have so little purchase on it. It's like a,
Speaker 1 there aren't that many really exciting feeling frontiers in like theoretical computer science.
Speaker 2 And then, so, a
Speaker 2 smart person, it doesn't have to be a grassroot, but like a smart person is interested in this. What should they do?
Speaker 2 Should they try to attack some open problem you have, put on your blog, or should it,
Speaker 2 what is the next step?
Speaker 1 Yeah, I think like a first past step, like,
Speaker 1 there's different levels of ambition or whatever, different ways of approaching a problem. But like we have this write-up of like from last year, or I guess 11 months ago or whatever,
Speaker 1 on formalizing the presumption of independence that provides like, here's kind of a communication of what we're looking for in this object.
Speaker 1 And like, I think the motivating problem is saying, like, here's a notion of what an estimator is, and here's what it would mean for an estimator to capture some set of informal arguments. And like
Speaker 1 a very natural problem is just try and do that,
Speaker 1 go for the whole thing, try and understand, and then come up with
Speaker 1 hopefully a different approach, or then end up having context from a different angle on the kind of approach we're taking. I think that's a reasonable thing to do.
Speaker 1 I do think we also have a bunch of open problems. So maybe we should put up more of those open problems.
Speaker 1 And the main concern with doing so is that for any given one, we're like, this is probably hopeless.
Speaker 1 Like put up a prize earlier in the year for an open problem, which tragically, I mean, I guess the time is now to post the debrief from that, or I owe it from this weekend.
Speaker 1 I was supposed to do that so probably do it tomorrow but
Speaker 1 no one solved it
Speaker 2 it's sad putting out problems that are hard or like I don't we could put out a bunch of problems that we think might be really hard but I mean what was that famous case of that statistician who it was like some PhD student who showed up late to a class and he saw some problems in the ward and he thought they were homework and then they were actually just open problems and then he solved them because he thought they were homework right yeah I mean we don't we have much less information these problems are hard like
Speaker 1 again I expect the solution to most of our problems to not be that complicated. We have not, and we've been working on it in some sense for a really long time.
Speaker 1 Like, you know, total years of full-time equivalent work across the whole team is like
Speaker 1 probably like three years of full-time equivalent work in this area
Speaker 1 spread across a couple people. But like,
Speaker 1 that's very little compared to a problem.
Speaker 1 Like, it is very easy to have a problem where you put in three years of full-time equivalent work, but in fact, there's still an approach that's going to work quite easily with like three to six months if you come at a new angle.
Speaker 1 And like, we've learned a a fair amount from that that we could share and we probably will be sharing more over the coming months.
Speaker 2 As far as funding goes, is this something where, I don't know if somebody gave you a whole bunch of money that would help or does it not matter? How many people are working on this, by the way?
Speaker 1 So we have been, right now there's four of us full-time and we're hiring for more people.
Speaker 2 And then
Speaker 2 is funding that would matter?
Speaker 1
I mean, funding is always good. We're not super funding constrained right now.
The main effect of funding is it will cause me to continuously and perhaps indefinitely delay fundraising.
Speaker 1 Periodically, I'll set out to be be interested in fundraising, and someone will be like, offer a grant, and then I will get to delay for another six months or fundraising or nine months, whatever.
Speaker 1 So you can delay the time at which Paul needs to think for some time about fundraising.
Speaker 2 Well, one question I think it'd be interesting to ask you is,
Speaker 2 you know, I think people can talk vaguely about the value of theoretical research and how it contributes to real-world applications. And, you know, you can look at historical examples or something.
Speaker 2 But you are somebody who actually has done this in a big way. Like RLHF is
Speaker 2 something you developed, and then it actually has gotten into an application that has been used by millions of people. Tell me about just that pipeline.
Speaker 2 How can you reliably identify theoretical problems that will matter for real-world applications? Because it's one thing to read about Turing or something, and
Speaker 2 the halting problem, but here you'd have the real thing.
Speaker 1 Yeah, I mean, it is definitely exciting to have worked on a thing that has a real-world impact. The main caveat I'd provide is
Speaker 1 RLHF is very, very simple
Speaker 1 compared to many things.
Speaker 1 And like, so the motivation for working on that problem was like, look, this is how it probably should work, or like, this is a step in some like progression.
Speaker 1 It's unclear if it's like the final step or something, but it's a very natural thing to do that like people probably should be and probably will be doing.
Speaker 1 I'm saying like, if you want to do, if you want to talk about crazy stuff, it's good to like help make those steps happen faster.
Speaker 1 And it's good to learn about like what are, there's lots of issues that occur in practice, even for things that seem very simple on paper.
Speaker 1 But mostly like the story of is just like yep i think my sense of the world is things that look like good ideas on paper just like often are harder than they look but like the world isn't that far from what makes sense on paper like large language models look really good on paper and our lightfive looks really good on paper and these things like i think just work out in a way that's
Speaker 1 yeah i think people maybe overestimate or like
Speaker 1 Maybe it's like kind of a trope. But people talk about like it's easy to like underestimate how much gap there is to practice like how many things will come up that don't come up in theory.
Speaker 1 But it's also easy to overestimate how inscrutable the world is. The things that happen mostly are things that do just kind of make sense.
Speaker 1 Yeah, I feel like most ML implementation does just come down to a bunch of detail, though, of like, you know, build a very simple version of the system, understand what goes wrong, fix the things that go wrong, scale it up, understand what goes wrong.
Speaker 1 And I'm glad I have some experience doing that, but I don't think, I think that does cause me to be better informed by what makes sense in ML and what can actually work.
Speaker 1 But I don't think it caused me to have a whole lot of deep expertise
Speaker 1 or or like deep wisdom about like how to close the how to close the gap.
Speaker 2 Yeah, yeah.
Speaker 2 But
Speaker 2 is there some tip on identifying things like RLHF which actually do matter versus making sure you don't get stuck in some theoretical problem that doesn't matter? Or is it just coincidence?
Speaker 2 Or I mean, is there something you can do in advance to make sure that the thing is useful?
Speaker 1 I don't know if the RLHF story is like the
Speaker 1 best success case or something, but...
Speaker 1 Oh, because
Speaker 1 the capabilities. Maybe I'd say more profoundly, like, again, it's just not that hard a case.
Speaker 1 It was like a little bit, it's a little bit unfair to be like, I'm going to predict the thing, which I, like, I pretty much think it was going to happen at some point.
Speaker 1 And so it was mostly a case of acceleration.
Speaker 1 Whereas the work we're doing right now is specifically focused on something that's like kind of crazy enough that it might not happen, even if it's a really good idea or challenging enough, it might not happen.
Speaker 1 But I'd say like in general, like,
Speaker 1 and this draws a little bit on like more broad experience more broadly in theory. It's just like
Speaker 1 a lot of the times when theory fails to connect with practice, it's just kind of clear.
Speaker 1 It's not going to connect if you like try, if you actually think about it and you're like, what are the key constraints in practice?
Speaker 1 Is the theoretical problem we're working on actually connected to those constraints? Is there a path?
Speaker 1 Like, is there something that is possible in theory that would actually address like real world issues?
Speaker 1 I think the vast majority, like as a theoretical computer scientist, the vast majority of theoretical computer science has very little chance of ever affecting practice, but also it is completely clear in theory that has very little chance of affecting practice.
Speaker 1 Like most of the theory fails to affect practice, not because of like all the stuff you don't think of, but just because
Speaker 1 it was like,
Speaker 1 you could call it dead on arrival, but you also feel like it's not really the point. It's just like mathematicians also are like, they're not trying to affect practice.
Speaker 1 And they're not like, why does my number theory not affect practice? It was kind of obvious.
Speaker 1 So I think the biggest thing is just actually caring about that and then learning at least what's basically going on in the actual systems you care about and what are actually the important constraints.
Speaker 1 And is this a real theoretical problem? The basic reason most theory doesn't do that is just like, that's not where the easy theoretical problems are.
Speaker 1 So I think theory is instead motivated by like we're going to build up the edifice of theory and like sometimes they'll be opportunistic.
Speaker 1 Like opportunistically we'll find a case that comes close to practice or we'll find something practitioners are already doing and try and bring into our framework or something.
Speaker 1 But the theory of change is mostly not. This thing is going to make it into practice.
Speaker 1 It's mostly like this is going to contribute to the body of knowledge that will slowly grow and like sometimes opportunistically yield important results.
Speaker 2 How big do you think a seed AI would be?
Speaker 2 What is the minimum sort of encoding of something that is as smart as a human?
Speaker 1 I think it depends a lot what substrate it gets to run on. So if you tell me like, like, how much computation does it get before, or like what kind of real-world infrastructure does it get?
Speaker 1 Like you could ask, what's the shortest program which like if you run it on a million H100s connected in like a nice network with like a hospitable environment will eventually go to the stars.
Speaker 1 But that seems like it's probably on the order of like tens of thousands of bytes or I don't know. If I had to guess a median, I'd guess 10,000 bytes.
Speaker 2 Wait, wait, the specification or the compression of just the program, a program which wouldn't run.
Speaker 1
Oh, got it, got it, got it. But that's going to be really cheatsy.
So you could ask what's the thing that has values and will like expand and like roughly preserve its values as that proceeds.
Speaker 1 Because like that thing, the 10,000 byte thing, we'll just lean heavily on like evolution and natural selection to get there.
Speaker 1 For that,
Speaker 1 I don't know,
Speaker 1 million bytes? Million bytes. 100,000 bytes, something like that.
Speaker 2 How do you think AI lie detectors will work? Where you kind of just look at the activations and
Speaker 2 not find explanations in the way you were talking about with heuristics, but literally just like, here's what truth looks like, here's what lies look like, let's just segregate the latent space
Speaker 2 and see if we can identify the two.
Speaker 1 Yeah, I think to separate the, like just train a classifier to do it is like a little bit complicated for a few reasons and like may not work.
Speaker 1 But if you just like brought in the space and say like, hey, it's like you want to know if someone's lying.
Speaker 1 You get to interrogate them, but also you get to like rewind them arbitrarily and make a million copies of them. I do think it's like pretty hard to lie successfully.
Speaker 1 You get to like look at their brain, even if you don't quite understand what's happening. You get to rewind them a million times.
Speaker 1 You get to run all those parallel copies into gradient descent or whatever.
Speaker 1 I think there's a pretty good chance that you can just tell if someone is lying.
Speaker 1 Like a brain emulation or an AI or whatever.
Speaker 1 Unless they were aggressively selected.
Speaker 1 If it's just they are trying to lie well rather than it's like they were selected over many generations to be excellent at lying or something, then like your ML system, hopefully didn't train it a bunch to lie and you want to be careful about whether your training scheme effectively does that.
Speaker 1 But yeah, that seems like it's more likely than not to succeed.
Speaker 2 Aaron Powell, and how possible do you think it will be for us to specify human verifiable rules for reasoning such that
Speaker 2 even if the AI is super intelligent, we can't really understand why it does certain things, we know that the way in which it arrives at these conclusions is valid.
Speaker 2 Like it was trying to persuade us to something. We can be like, I don't understand all the steps, but I know that this is something that's valid and you're not just making shit up.
Speaker 1 That seems very hard if you wanted to be competitive with learned reasoning.
Speaker 1 So I don't, I mean, it depends a little bit exactly how you set it up, but for like the ambitious versions of that, to say it would address the alignment problem,
Speaker 1 they seem pretty unlikely. You know, like 5%, 10% kind of thing.
Speaker 2 Is there an upper bound on intelligence? Not in the near term, but just super intelligence at some point.
Speaker 1 How far do you think that can go? It seems like it's going to depend a little bit on what is meant by intelligence.
Speaker 1 It kind of reads as a question that's similar to like, is there an upper bound on like strength or something? Like there are a lot of forms.
Speaker 1 I think it's like the case that for, yeah, I think there are like sort of arbitrarily smart input-output functionalities.
Speaker 1 And then, like, if you hold fixed the amount of compute, there is some smartest one. If you're just like, what's the best set of like 10 to the 40th operations?
Speaker 1 There's just, there's only finitely many of them. So, some like best one for any particular notion of best that you have in mind.
Speaker 1 So, I guess like, I'm just like, for the unbounded question, where you're allowed to use arbitrary description complexity and compute, like, probably no. And for the,
Speaker 1 I mean, there is some like optimal conduct. Like, if you're like, I have some goal in mind, and I'm just like, what action best achieves it?
Speaker 1 If you imagine a little box embedded in the universe, I think there's kind of just an optimal input-output behavior.
Speaker 1 So I guess in that sense, I think there is an upper bound, but it's not saturatable in the physical universe, because it's definitely exponentially slow.
Speaker 2 Right, yeah, yeah. Or
Speaker 2 because of comms or other things, or heat, it just might be physically impossible to instatuate something smarter than this.
Speaker 1 Yeah, I mean, like, for example, if you if you imagine what the best thing is, it would almost certainly involve just like simulating every possible universe it might be in, modulo, like moral constraints, which I don't know know if you want to include them.
Speaker 1 But like, so that would be very, very slow. It would involve simulating all, you know, it's sort of like,
Speaker 1 I don't know exactly how slow, but like double exponential, very slow.
Speaker 2 Carl Schulman laid out his picture of the intelligence explosion in the seven-hour episode.
Speaker 2 What, I know you guys have talked a lot. What about his basic picture? Like, what is it? Do you have some main disagreements? Is there some crux that you guys have explored?
Speaker 1 It's related to our timelines discussion from earlier. I think the biggest,
Speaker 1 yeah, I think the biggest issue is probably error bars, where like Carl has a very, like, very software-focused, very fast kind of takeoff picture. And I think that is plausible, but not that likely.
Speaker 1 Like, I think it's a couple ways you could perturb the situation, and my guess is one of them applies.
Speaker 1 So maybe I have like,
Speaker 1 I don't know exactly what Carl's probability is. I feel like Carl's going to have like a 60% chance on some crazy thing that I'm only going to assign like a 20% chance to or 30% chance or something.
Speaker 1 And I think those kinds of perturbations are like,
Speaker 1 one, how long a period is there of complementarity between AI capabilities and human capabilities, which will tend to soften takeoff.
Speaker 1 Two, how much diminishing returns are there on software progress, such that is a
Speaker 1 broader takeoff involving scaling electricity production and hardware production, is that likely to happen during takeoff? Where I'm more like 50-50 or more?
Speaker 1 Stuff like this.
Speaker 2 Yeah, okay. So is it that you think the ultimate constraints will be more hard?
Speaker 2 basic case he's laid out is that you can just have a sequence of things like flash attention or MOE and you can just keep stacking these kinds of things on.
Speaker 1 I'm very unsure if you can keep stacking them. Or like it's kind of a question of what's like the returns curve.
Speaker 1 And like Carl has some inference from historical data or some way he'd extrapolate the trend. I am more like 50-50 on whether the software-only intelligence explosion is even possible.
Speaker 1 And then like a somewhat higher probability that it's slower than
Speaker 2 but you think it might not be possible?
Speaker 1 Well, so the entire question question is like, if you double R D effort, do you get enough additional improvement to further double the efficiency?
Speaker 1 And like, that's that question will itself be a function of your hardware base, like how much hardware you have.
Speaker 1 And the question is like, at the amount of hardware we're going to have and the level of sophistication we have as the process begins, like, is it the case that each doubling of,
Speaker 1 actually the initials only depends on the hardware, or like each level of hardware will have some place at this dynamic asymptotes.
Speaker 1 So the question is just like, for how long is it the case that each doubling of R D at least doubles the effective output of your you know AI research population and I think like I have a higher probability on that like I think it's kind of close if you look at the empirics I think the empirics benefit a lot from like continuing hardware scale up so that like the effective R D stock is like significantly smaller than it looks if that makes sense what are the empirics you're referring to um so there's kind of two sources of evidence one is like looking across a bunch of industries at like what is the general improvement with each doubling of like either R D investment or experience where like it is quite exceptional to have a field with not
Speaker 1 anyway, it's pretty good to have a field where each time you double R D investment, you get a doubling of efficiency.
Speaker 1 The second source of evidence is on actual algorithmic improvement in ML, which is obviously much, much scarcer.
Speaker 1 And there like, you can make a case that it's been like each doubling of R D has given you roughly a 4x or something increase in computational efficiency.
Speaker 1 But there's a question of how much that benefits. When I say the effect of R D stock is smaller, I mean like we scale up, you're doing a new task.
Speaker 1 Every couple years are doing a new task because you're operating at a scale much larger than the previous scale. And so a lot of your effort is how to make use of the new scale.
Speaker 1 So if you're not increasing your installed hardware base and just flat at a level of hardware, I think you get much faster diminishing returns than people have gotten historically.
Speaker 1 I think Carl agrees in principle this is true. And then once you make that adjustment, I think it's very unclear where the empirics shake out.
Speaker 1 I think Carl has thought about these more than I am, so I should maybe defer more. But anyway, I'm at like 50-50 on that.
Speaker 2 How have your timelines changed over the last 20 years?
Speaker 1 Last 20 years? Yeah.
Speaker 2 How long have you been working on anything related to AI?
Speaker 1 So I started thinking about this stuff in like 2010
Speaker 1 or so. So I think my first, my earliest timeline prediction will be in like 2011.
Speaker 1 I think in 2011 my like rough picture was like we will not have insane AI in the next 10 years and then like I get increasingly uncertain after that but like we converge to like you know 1% per year or something like that.
Speaker 1 And then probably in 2016, my take was like we won't have crazy AI in the next five years, but then we converge to like one or 2% per year after that.
Speaker 1 Then in 2019, I guess I made a round of forecasts
Speaker 1 where I gave like
Speaker 1 30% or something to 25%
Speaker 1
to Crazy AI by 2040 and like 10% by 2030 or something like that. So I think my 2030 probability has been kind of stable and my 2040 probability has been going up.
And I would guess it's too sticky.
Speaker 1 I guess that 40% I gave at the beginning is just like from not having updated recently enough, and I maybe just need to sit down. I would guess that should be even higher.
Speaker 1
I think like 15% in 2030, I'm not feeling that bad about. This is just like each passing year is like a big update against 2030.
Like, we don't have that many years left.
Speaker 1 Um, and that's like roughly counterbalanced with AI going pretty well. Whereas for like the 2040 thing, like the passing years are not that big a deal.
Speaker 1 And like, as we see that, like, things are basically working, that's like cutting out a lot of the probability of not having AI by 2040.
Speaker 1 So, yeah, my 2030 probability up a little bit, like maybe twice as high as it used to be, or like something like that. My 2040 probably like up more, much more significantly.
Speaker 2 How fast do you think
Speaker 2 we can keep building fabs to keep up with AI demand?
Speaker 1 Yeah, I don't know much about any of the relevant areas. My best guess is like,
Speaker 1 I mean, my understanding is right now like
Speaker 1 5% or something of the next year's
Speaker 1 total or best processed fabs will be making AI hardware, of which
Speaker 1 only a small fraction will be be going into very large training runs, like only a couple, so maybe a couple percent of total output.
Speaker 1 And then like that represents maybe like 1% of total possible output, a couple percent of like leading process, 1% of total or something. I don't know if that's right.
Speaker 1 I think that's like the rough ball perk we're in.
Speaker 1 I think things will be like pretty fast as you scale up for like the next order of magnitude or two from there because you're basically just shifting over other stuff.
Speaker 1 My sense is it would be like years of delay. There's like multiple reasons that you expect years of delay for going past that.
Speaker 1 Maybe even at that, you start having, yeah, there's just a lot of problems. Like building new fabs is quite slow.
Speaker 1 And I don't think there's like, TSMC is not like planning on increases in total demand driven by AI, like kind of conspicuously not planning on it.
Speaker 1 I don't think anyone else is really ramping up production in anticipation either.
Speaker 1 So I think, and then similarly, like just building data centers of that size seems like very, very hard and also probably has multiple years of delay.
Speaker 2 What does your portfolio look like?
Speaker 1 I've tried to get rid of most of the AI stuff that's like plausibly implicated in
Speaker 1 policy work or like
Speaker 1 advocacy on the RSP stuff or my involvement with Anthropic.
Speaker 2 What would it look like if you had no conflicts of interest?
Speaker 1
And no inside information. Like, I also still have a bunch of hardware investments, which I need to think about.
But like,
Speaker 1 I don't know. A lot of TSMC.
Speaker 1
I have a chunk of NVIDIA, although I just keep betting against NVIDIA constantly since 2016 or something. I've been destroyed on that bet.
Although AMD has also done fine.
Speaker 1 I just like, well, the case now is even easier, easier, but it's similar to the case in the old days. It's just a very expensive company, given the total amount of R D investment they've made.
Speaker 1 They have like, whatever, a trillion dollar valuation or something.
Speaker 1 That's like
Speaker 1 very high.
Speaker 1 So the question is like, how expensive is it to like make a TPU such that it's like actually out-competes H100 or something? And I'm like, wow.
Speaker 1 It's real level, high level of incompetence if Google can't catch up fast enough to make that trillion dollar valuation not justified.
Speaker 2 Whereas with TSMC, it's much harder. They have a harder remote, you think?
Speaker 1 Yeah, I think it's a lot harder, especially if you're in this regime where you're trying to scale up.
Speaker 1 So if you're unable to build fabs, I think it will take a very long time to build as many fabs as people want.
Speaker 1 The effect of that will be to bid up the price of existing fabs and existing semiconductor manufacturing equipment.
Speaker 1 And so just those hard assets will become spectacularly valuable, as will the existing GPUs and the actual...
Speaker 1 Yeah.
Speaker 1
Yeah, I think it's just hard. That seems like the hardest asset to scale up quickly.
So it's like the asset, if you have a rapid run-up, it's the one that you'd expect to most benefit. Whereas like
Speaker 1 NVIDIA's stuff will ultimately be replaced by either better stuff made by humans or stuff made by AI assistants. The gap will close even further as you build AI systems.
Speaker 2 Right. Unless NVIDIA is using those systems.
Speaker 1
Yeah. The point is just that any RD will so dwarf past RND as you're hearing.
And there's like just not that much stickiness. There's less stickiness in the future than there has been in the past.
Speaker 1 Yeah.
Speaker 1 I don't know. So I don't want to not commenting from any private information, just in my gut, having caveat of of this is like the single bet I've most lost,
Speaker 1 not including Nvidia in that portfolio.
Speaker 2 And final question. There's a lot of schemes out there for alignment.
Speaker 2 And I think just like a lot of general takes, and a lot of this stuff is over my head where I think I literally took me like weeks to understand
Speaker 2 the mechanistic anomaly stuff you work on.
Speaker 2 Without spending weeks, how do you detect bullshit? Like people have explained their schemes to me, and I'm like, honestly, I don't know if it makes sense or not.
Speaker 2 With you, I'm just like, I trust Paul enough that I think there's probably something here if I try to understand this enough. But with other, yeah, how do you, how do you detect bullshit?
Speaker 1 Yeah, so I think it depends on the kind of work. So for like the kind of stuff we're doing, my guess is like most people, there's just not really a way you're going to tell whether it's bullshit.
Speaker 1 So I think it's important that we don't spend that much money on like the people who want to hire us are probably going to dig in in depth.
Speaker 1 I don't think there's a way you can tell whether it's bullshit without either spending like a lot of effort or leaning on deference.
Speaker 1 With empirical work, it's like interesting in that you do have some signals of the quality of work. You can be like, I mean, mean, does it work in practice?
Speaker 1 Like, does the story, I think the stories are just radically simpler. And so you probably can evaluate those stories on their face.
Speaker 1 And then you mostly come down to these questions about what are the key difficulties. Yeah, I tend to be optimistic.
Speaker 1 When people dismiss something because this doesn't deal with a key difficulty or this runs into the following insuperable obstacle, I tend to be a little bit more skeptical about those arguments and tend to think like...
Speaker 1
Yeah, something can be bullshit because it's not addressing a real problem. That's like I think the easiest way.
Like this is a problem someone's interested in.
Speaker 1 That's just like not actually an important problem and there's no story about why it's going to become an important problem, e.g., like, it's not a problem now and won't get worse, or it is maybe a problem now, but it's clearly getting better.
Speaker 1 That's like one way.
Speaker 1 And then, conditioned on like passing that bar, like, dealing with something that actually engages with important parts of the argument for concern, and then like actually making sense empirically.
Speaker 1 So, like, I think most work is anchored by source of feedback, is like actually engaging with real models. So, it's like, does it make sense to have engaged with real models?
Speaker 1 And does the story about how it
Speaker 1 deals with key difficulties actually make sense?
Speaker 1 I'm like pretty liberal past there.
Speaker 1
I think it's really hard to like EG people look at mechanistic interpretability and be like, well this obviously can't succeed. And I'm like, I don't know.
How can you tell it obviously can't succeed?
Speaker 1 Like I think it's reasonable to take total investment in the field. Like how fast is it making progress? Like
Speaker 1 how does that pencil? I think like most things people work on, though, actually pencil like pretty fine. Like they look like they could be reasonable investments.
Speaker 1 Things are not like super out of whack.
Speaker 2
Okay, great. This is, I think, a good place to close.
Paul, thank you so much for your time.
Speaker 1
Yeah, thanks for having me. It was good chatting.
Yeah, absolutely.
Speaker 2
Hey, everybody. I hope you all enjoyed that episode.
As always, the most helpful thing you can do is to share the podcast. Send it to people you think might enjoy it.
Speaker 2
Put it in Twitter, your group chats, et cetera. Just blitz the world.
I appreciate you listening. I'll see you next time.
Cheers.