Carl Shulman (Pt 2) - AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future

June 26, 2023 3h 7m

The second half of my 7 hour conversation with Carl Shulman is out!

My favorite part! And the one that had the biggest impact on my worldview.

Here, Carl lays out how an AI takeover might happen:

* AI can threaten mutually assured destruction from bioweapons,

* use cyber attacks to take over physical infrastructure,

* build mechanical armies,

* spread seed AIs we can never exterminate,

* offer tech and other advantages to collaborating countries, etc

Plus we talk about a whole bunch of weird and interesting topics which Carl has thought about:

* what is the far future best case scenario for humanity

* what it would look like to have AI make thousands of years of intellectual progress in a month

* how do we detect deception in superhuman models

* does space warfare favor defense or offense

* is a Malthusian state inevitable in the long run

* why markets haven't priced in explosive economic growth

* & much more

Carl also explains how he developed such a rigorous, thoughtful, and interdisciplinary model of the biggest problems in the world.

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.

Catch part 1 here

Timestamps

(0:00:00) - Intro

(0:00:47) - AI takeover via cyber or bio

(0:32:27) - Can we coordinate against AI?

(0:53:49) - Human vs AI colonizers

(1:04:55) - Probability of AI takeover

(1:21:56) - Can we detect deception?

(1:47:25) - Using AI to solve coordination problems

(1:56:01) - Partial alignment

(2:11:41) - AI far future

(2:23:04) - Markets & other evidence

(2:33:26) - Day in the life of Carl Shulman

(2:47:05) - Space warfare, Malthusian long run, & other rapid fire

Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Download Audio Original Episode

Listen and Follow Along

Speed:

Full Transcript

If you have an AI that produces bioweapons that could kill most humans in the world, then it's playing at the level of the superpowers in terms of mutually assured destruction. What are the particular zero-day exploits that the AI might use, the conquistadors? With some technological advantage in terms of weaponry and whatnot, very, very small bands were able to overthrow these large empires, or if you predicted the global economy is going to be skyrocketing into the stratosphere within 10 years.
These AI companies should be worth a large fraction of the global portfolio. And so this is indeed contrary to the efficient market hypothesis.
This is like literally the top in terms of contributing to my world model in terms of all the episodes I've done. How do I find more of these? So we've been talking about alignment.
Suppose we fail at alignment and we have AIs that are unaligned and at some point becoming more and more intelligent. What does that look like? How concretely could they disempower and take over humanity? This is a scenario where we have many AI systems.
The way we've been training them means that they're not interested when they have the opportunity to take over and rearrange things to do what they wish, including having their reward or loss be whatever they desire, they would like to take that opportunity. And so in many of the existing kind of safety schemes, things like constitutional AI or whatnot, you rely on the hope that one AI has been trained in such a way that it will do as it is directed to then police others.
But if all of the AIs in the system are interested in a takeover and they see an opportunity to coordinate all act at the same time, so you don't have one AI interrupting another and taking steps towards a takeover, yeah, then they can all move in that direction. And the thing that I think maybe is worth going into in depth and that I think people often don't cover in great concrete detail, which is a sticking point for some, is, yeah, what are the mechanisms by which that can happen? And I know you had Eliezer on who mentions that, you know, whatever plan we can describe, there'll probably be elements where, you know, not being ultra sophisticated, super intelligent beings, having thought about it for the equivalent of thousands of years, you know, our discussion of it will not be as good as theirs, but we can explore from what we know now, what are some of the easy channels? And I think it's a good general heuristic if you're saying, yeah, it's possible, plausible, probable that something will happen, that it shouldn't be that hard to take samples from that distribution to try a Monte Carlo approach.
And in general, if a thing is

quite likely, it shouldn't be super difficult to generate, you know, coherent, rough outlines of

how it could go. You might respond like, listen, what is super likely is that a super advanced

chess program beats you, but you can generate a concrete way in which you can't generate the concrete scenario by which that happens and if you could you would be as smart as the super smart you can say things like we know that like uh accumulating position it is possible to do in chess great players do it and then later they convert it into captures and checks and whatnot and so the same way, we can talk about some of the channels that are open for an AI takeover. And so these can include things like cyber attacks and hacking, the control of robotic equipment, interaction and bargaining with human factions, and say, well, here are these strategies.
Given the AI's situation, how effective do these things look? And we won't, for example, know, well, what are the particular zero-day exploits that the AI might use to hack the cloud computing infrastructure it's running on? We won't necessarily know if it produces a new bioweapon, what is its DNA sequence. But we can say things.
We know, in general, things about these fields, how work at innovating things in those go. We can say things about how human power politics goes and ask, well, if the AI does things at least as well as effective human politicians, which we should say is a lower bound, how good would its leverage be? Okay.
So yeah, let's get into the details on all these scenarios. The cyber and potentially bio attacks um unless they're separate channels the bargaining and then the takeover military force um the cyber cyber attacks and cyber security um i would really highlight a lot okay um because for many many plans that involve a lot of physical actions, like at the point where AI is piloting robots to shoot people or has taken control of human nation states or territory, there's been doing a lot of things that was not supposed to be doing.
And if humans were evaluating those actions and applying gradient descent, there would be negative feedback for this thing. No shooting the humans.
So at some earlier point, our attempts to leash and control and direct and train the system's behavior had to have gone awry. And so all of those controls are operating in computers.
And so from the software that updates the weights of the neural network in response to data points or human feedback is running on those computers. Our tools for interpretability to sort of examine the weights and activations of the eye, If we're eventually able to do like lie detection on it,

for example, or try and understand what it's intending,

that is software on computers.

And so if you have AI that is able to hack the servers

that it is operating on,

or when it's employed to design

the next generation of AI algorithms or the operating environment that they are going to be working in, or something like an API or something for plugins, if it inserts or exploits vulnerabilities to take those computers over, it can then change all of the procedures and programs that were supposed to be monitoring its behavior, supposed to be limiting its ability to, say, take arbitrary actions on the internet without supervision by some kind of human check or automated check on what it was doing. And if we lose those procedures, then the AI can, or the AIs working together, can take any number of actions that are just blatantly unwelcome, blatantly hostile, blatantly steps towards takeover.
And so it's moved beyond the phase of having to maintain secrecy and conspire at the level of its local digital actions. And then things can accumulate to the point of things like physical weapons, takeover of social institutions, threats, things like that.
But the point where things really went off the rails was where, and I think the critical thing to be watching for is the software controls over the AI's motivations and activities. The hard power that we once possessed over is lost, which can happen without us knowing it.
And then everything after that seems to be working well. We get happy reports.

There's a Potemkin village in front of us. But now we think we're successfully aligning our AI.
We think we're expanding its capabilities to do things like end disease for countries concerned about the geopolitical military advantages. They're sort of expanding the AI capabilities so they're not left behind and threatened by others developing AI and AI and robotic enhanced militaries without them.
So it seems like, oh, yes, humanity or some portions of many countries, companies think that things are going well. Meanwhile, all sorts of actions can be taken to set up for the actual takeover of hard power over society.
And then we can go into that. But the point where you can lose the game, where things go direly awry, maybe relatively early, It's when you no longer have control over the AIs to stop them from taking all of the further incremental steps to actual takeover.
I want to emphasize two things you mentioned there that refer to previous elements of the conversation. One is that they could design some sort of backdoor.
And that seems more plausible when you remember that sort of one of the premises of this model is that AI is helping with AI progress.

That's why we're getting such rapid progress in the next five to ten years. Well, if we get to that point.
And at the point where AI takeover risk seems to loom large, it's at that point where AI can indeed take on much of the and then all of the work of AI R&D. And the second is the sort of competitive pressures that you referenced that the least careful actor could be the one that it has the worst infrascurity, has done worst work of aligning its AI systems and if that can sneak out of the box then you know we're all fucked there may be elements of that it's also possible that there's relative consolidation that's the largest training runs and the cutting edge of AI is relatively localized like you can imagine it's sort of like a series of Silicon Valley companies and other located, say, in the US and allies where there's a common regulatory regime.
And so none of these companies are allowed to deploy training runs that are larger than previous ones by a certain size without government safety

inspections, without having to meet criteria. But it can still be the case that even if we succeed at that level of kind of regulatory controls, that then at the level of, say, you know, the United States and its allies, decisions are made to develop this kind of really advanced AI without a level of security or safety that in actual fact blocks these risks.
So it can be the case that the threat of future competition or being overtaken in the future is used as an argument to compromise on safety beyond a standard that would have actually been successful. And there'll be debates about what is the appropriate level of safety.
Now, you're in a much worse situation if you have, say, several private companies that are very closely bunched up together. They're within months of each other's level of progress.
And then they then face a dilemma of, well, we could take a certain amount of risk now and potentially gain a lot of profit or a lot of advantage or benefit and be the ones who made AI or at least AGI AGI. They can do that or have some other competitor that will also be taking a lot of risk.
So it's not as though they're much less risky than you, and then they would get some local benefit. Now, this is a reason why it seems to me that it's extremely important that you have government act to limit that dynamic and prevent this kind of race to be the one to impose the deadly externalities on the world at large.
So even if government coordinates all these actors, what are the odds that the government knows what is the best way to implement alignment and the standards it sets are well calibrated towards whatever it would require for alignment?

That's one of the major problems.

It's very plausible that that judgment is made poorly. Compared to how things might have looked 10 years ago or 20 years ago, there's been an amazing movement in terms of the willingness of AI researchers to discuss these things.
So if we think of the three founders of Deep Learning, joint Turing Award winners, so Jeff Hinton, Yoshua Bengio, and Yan Lekun. So Jeff Hinton has recently left Google to freely speak about this risk that the field that he really helped drive forward could lead to the destruction of humanity or a world where, yeah, we just wind up in a very bad future that we might have avoided.
And he seems to be taking it very seriously. And Yoshua Bengio signed the FLI pause letter.
And I mean, in public discussions, he seems to be occupying a kind of intermediate position of sort of less concern than DeFinton, but more than Jan LeCun, who has taken a generally dismissive attitude. These risks will be trivially dealt with at some point in the future and seems more interested in kind of shutting down these concerns or work to address them.
And how does that lead to the government having better actions? Yeah. So compared to the world where no one is talking about it, where the industry stonewalls and denies any problem, we're in a much improved position and the academic fields are influential.
So we seem to have avoided a world where governments are making these decisions in the face of a sort of united front from AI expert voices saying, don't worry about it. We've got it under control.
In fact, many of the leaders of the field, as has been true in the past, are sounding the alarm. And so I think, yeah, it looks like we have a much better prospect than I might have feared in terms of government sort of noticing the thing.
it was very different from being capable of evaluating

sort of technical details.

Is this really working?

And so government will face the choice of where there is scientific dispute.

Do you side with Jeff Hinton's view or Jan LeCun's view?

And so someone who's very much in a national security,

the only thing that's important is, you know, outpacing our international rivals kind of mindset may want to then try and like boost Yan Lacun's voice and say, we don't need to worry about it. Full speed ahead.
will power where someone with more concern might then boost Jeff Hinton's voice.

Now, I would hope that scientific research and things like studying some of these behaviors will result in more scientific consensus by the time we're at this point. But yeah, it is possible the government will really fail to understand and fail to deal with these issues well.
We're talking about cyber, some sort of cyber attack by which the AI is able to escape. From there, what does the takeover look like? So it's not contained in the air gap in which you would hope it'd be contained.
Well, I mean, the things are not contained in the air gap. They're connected to the internet already.
Sure, sure. Okay, fine.
But the weights are out. So what happens next? Yeah.
So escape is relevant in the sense that if you have AI with rogue weights out in the world, it could start doing various actions. The scenario I was just discussing, though, didn't necessarily involve that.
It's taking over the very servers on which it's supposed to be. So the ecology of cloud compute in which it's supposed to be running.
And so this whole procedure of humans providing compute and supervising the thing, and then building new technologies, building robots, constructing things with the AI's assistance, that can all proceed and appear like it's going well, appear like alignment has been nicely solved, appear like all the things are functioning well. And there's some reason to do that because there's only so many giant server farms that are identifiable.
And so remaining hidden and unobtrusive could be an advantageous strategy if these AIs have subverted the system, just continuing to benefit from all of this effort on the part of humanity. and in particular, humanity, wherever these servers are located, to provide them with everything they need to build the further infrastructure and do further self-improvement and such to enable that takeover.
So they do further self-improvement and build better infrastructure. What happens next in the takeover? They have at this point tremendous cognitive resources.
And we're going to consider how does that convert into hard power, the ability to say nope to any human interference or objection. And they have that internal to their servers, but the servers could still be physically destroyed, at least until they have something that is independent and robust of humans or until they have control of human society.

So just like earlier, when we were talking about the intelligence explosion, I noted that a surfeit of cognitive abilities is going to favor applications, but don't depend on large existing stocks of things. So if you have a software improvement, it makes all the GPUs run better.
If you have a hardware improvement that only applies to new chips being made, that second one is less attractive. And so in the earliest phases, when it's possible to do something towards takeover, then interventions that are just really knowledge intensive and less dependent on having a lot of physical stuff already under your control are going to be favored.
And so cyber attacks are one thing. So it's possible to do things like steal money.
And there's a lot of hard to trace cryptocurrency and whatnot. The North Korean government uses its own intelligence resources to steal money from around the world just as a revenue source.
And their capabilities are puny compared to the US or People's Republic of China cyber

capabilities.

And so that's a kind of fairly minor, simple example by which you could get quite a lot

of funds to hire humans to do things, implement physical actions.

But on that point, I mean, the financial system is famously convoluted. And so, you know, you need like a physical person to open a bank account.
So nothing to physically move checks back and forth. There is like all kinds of delays and regulations.
How is it able to conveniently set up all these employment contracts? So you're not going to build a sort of nation scale military by stealing, you know, tens of billions of dollars. I'm raising this as opening a set of illicit and quiet actions.
So you can contact people electronically, hire them to do things, hire criminal elements to implement some kinds of actions under false appearances. So that's opening a set of strategies that can cover some of what those are soon.
Another domain that is heavily cognitively weighted compared to physical military hardware is the domain of bioweapons. So the design of a virus or pathogen, it's possible to have large delivery systems

for The design of a virus or pathogen, it's possible to have large delivery systems for the Soviet Union, which had a large illicit bioweapons program, tried to design munitions to deliver anthrax over large areas and such. But if one creates an infectious pandemic organism, that's more a matter of the scientific skills and implementation to design it and then to actually produce it.
And we see today with things like AlphaFold that advanced AI can really make tremendous strides in predicting protein folding and bio design, even without ongoing experimental feedback. And if we consider this world where AI cognitive abilities have been amped up to such an extreme, I think we should naturally expect we will have something much, much more potent than the alpha folds of today and just skills that are at the extreme of human biosciences capability as well.
Through some sort of cyber attack, it's been able to disempower the sort of alignment and oversight things that we have on the server. From here, it's either gotten some money through hacking cryptocurrencies or bank accounts, or it's designed some sort of bioweapon.

What happens next?

Yeah, and just to be clear, so right now we're exploring the branch of where an attempted takeover occurs relatively early. If the thing just waits and humans are constructing more fabs, more computers, more robots in the way we talked about earlier when we were discussing how the intelligence explosion translates to the physical world.

If that's all happening with humans unaware that their computer systems are now systematically controlled by AIs hostile to them and that their controlling countermeasures don't work, then humans are just going to be building an amount of robot, industrial, and military hardware that dwarfs human capabilities and directly human-controlled devices. then what the AI takeover looks like at that point can be just,

you try to give an order to your largely automated military and the order is not obeyed. And humans can't do anything against this largely automated military that's been constructed potentially in just recent months because of the pace of robotic industrialization and replication we talked about.

We've agreed to allow the construction of this robot army because basically it would like boost production or help us with our military or something.

The situation would be something like if we don't resolve the sort of current problems of international distrust, where now it's obviously an interest of like the major powers, um, you know, the U S European union, Russia, China to all agree. They would like AI not to destroy our civilization and overthrow every human government.
Um, but if they fail to do the sensible thing, uh, and coordinate on ensuring that this technology is not going to run amok by providing mutual assurances that are credible about racing and deploying it, trying to use it to gain advantage over one another. if they do that and you hear sort of hawks arguing for this kind of thing, we must never, on both sides of the international divides, saying they must not be left behind.

They must have military capabilities that are vastly superior to their international rivals.

And because of the extraordinary growth of industrial capability and technological capability and thus military capability, if one major power were left out of that expansion, it would be helpless before another one that had undergone it. And so if you have that environment of distrust where leading powers or coalitions of powers decide they need to build up their industry or they want to have that military security of being able to neutralize any attack from their rivals, then they give the authorization for this capacity that can be unrolled quickly.
Once they have the industry, the production of military equipment for that can be quick. Then, yeah, they create this military.
If they don't do it immediately, then his AI capabilities get synchronized and other places catch up. It then gets to be a point like a country that is a year ahead or two years ahead of others in this type of AI capabilities explosion can hold back and say, sure, we could construct dangerous robot armies that might overthrow our society later.
We still have plenty of breathing room. But then when things become close, you might have the kind of negative sum thinking that has produced war before leading to taking these risks of rolling out large scale robotic industrial capabilities and then military capabilities.
Is there any hope that somehow the AI progress itself is able to give us tools for diplomatic and strategic alliance or some way to verify the intentions or the capabilities yeah there are a number of ways that could happen although in this scenario all the ais in the world uh have been subverted and so they're going going along with us uh in such a way as to uh bring the situation to consolidate their control, because we've already had the failure of cybersecurity earlier on. So all of the AIs that we have are not actually working in our interests in the way that we thought.
Okay, so that's one direct way in which integrating this robot army or this robot industrial base leads to take over if there are no robots? In the other scenarios you laid out where humans are being hired by the proceeds? The point I'd make is that to capture these industrial benefits, and especially if you have a negative some arms race kind of mentality that is not sufficiently concerned about the downsides of creating a massive robot industrial base, which could happen very quickly with the support of the AIs in doing it, as we discussed. Then you create all those robots and industry and they can either, even if you don't build a formal military, with that, that industrial capability could be controlled by AI.
It's all AI operated anyway. Does it have to be that case? Presumably, we wouldn't be so naive as to just give one instance of GPT-8 the root access to all the robots, right? Hopefully, we would have some sort of mediating.
I mean, in the scenario we've lost earlier lost earlier on the cyber security front so we're you know the the programming that is being loaded in to these systems from the beginning systematically yeah be subverted got it okay they were they were they were designed by ai systems that were ensuring they would be vulnerable uh from the bottom up for for listeners who are skeptical of something like this, Ken Thompson, I think it is, was it his Turing Award lecture? This is Ken Thompson, by the way, is the designer of Unix. Maybe it was the other designer of Unix.
But anyways, he showed people when he was getting the Turing Award or some award that he had given himself root access to all Unix machines he had manipulated the assembly of unix such that um he had he had like a unique login for all unix machines that he could like log into a unix machine with i don't want to give too many more details because i'm i might i don't remember the exact details but like unix is the operating system that is on all the uh servers and all your phones phones. Like it's everywhere.
And the guy who made it, a human being, right, was able to write assemblies such that to give him like root access. So this is not as implausible as it might seem to you.
And the, you know, the major intelligence agencies have large stocks of zero day exploits. And we sometimes see them using them and making systems that reliably don't have them when you're having very, very, very sophisticated attempts to spoof and corrupt this would be a way you could lose.
Now, I bring this up as this is a sort of something like a a path of if there's no premature AI action, we're building the tools and mechanisms and infrastructure for the takeover to be just immediate because effective industry has to be under AI control and robotics and so it's right there. And so these other mechanisms are for things happening even earlier than that, for example, because AIs compete against one another.
And when the takeover will happen, some would like to do it earlier rather than be replaced by, say, further generations of AI or there's some other disadvantage of waiting. Or maybe if there's some chance of being uncovered during the delay, we were talking about one more infrastructure is built.
And so, yeah. So these are mechanisms other than just remain secret while all the infrastructure is built with human assistance.

By the way, how would they be coordinating?

I mean, we have limits on just what we can prevent. So encrypted communications, it's intrinsically difficult to stop that sort of thing.
There can be all sorts of palimpsest and references that make sense to an AI, but that are not obvious to a human. And it's plausible that there may be some of those that are hard even to explain to a human.
You might be able to identify them through some statistical patterns. And a lot of things may be done by implication.
You could have information embedded in public web pages that are being created for other reasons, scientific papers and the intranets of these AIs that are doing technology development and any number of things that are not observable. And of course, if we don't have direct control over the computers that they're running on, then they can be having all sorts of direct communication.

So definitely the coordination does not seem implausible.

As far as like the parts of this picture that seem, this one seems like one of the more straightforward ones.

So we don't need to get hung up on the coordination.

Moving back to the thing that happened before we built all of the infrastructure for it just to be the robot

stopped taking orders and there's nothing you can do about it because we've already built them all the vehicle bar yeah so bioweapons um the soviet union had a bioweapons program uh something like 50 000 people um they did not develop that much with the technology of the day um which was really not up to par. Modern biotechnology is much more potent.
After this huge cognitive expansion on the part of the AIs, it's much further along. And so bioweapons would be the weapon of mass destruction that is least dependent on huge amounts of physical equipment, things like centrifuges, uranium mines, and the like.
So if you have an AI that produces bioweapons that could kill most humans in the world, then it's playing at the level of the superpowers in terms of mutually assured destruction. That can then play into any number of things.
Like if you have an idea of, well, we'll just destroy the server farms if it became known that the AIs were misbehaving. Are you willing to destroy the server farms when the AI has demonstrated it has the capability to kill the overwhelming majority of the citizens of your country and every other country.
And that might give a lot of pause to a human response. On that point, wouldn't governments realize that if they go along with the AI, it's better to have most of your population die than completely lose power to the AI? Because obviously the reason the AI is manipulating you, the end goal is a takeover, right? Yeah, but so if the thing to do, certain death now, or go on, maybe try to compete, try to catch up, or accept promises that are offered.
And those promises might even be true. They might not.
And even if from the state of epistemic uncertainty, do you want to die for sure right now or accept demand, may I, to not interfere with it while it increments building robot infrastructure that can survive independently of humanity while it does these things.

And it can and will promise good treatment to humanity, which may or may not be true. But it would be difficult for us to know whether it's true.
And so this would be a starting bargaining position of diplomatic relations with a power that has enough nuclear weapons to destroy your country. It's just different than negotiations with a random rogue citizen engaged in criminal activity or an employee.
And so this isn't enough on its own to take over everything, but it's enough to have a significant amount of influence over how the world goes.

It's enough to hold off a lot of countermeasures one might otherwise take.

Okay, so we've got two scenarios.

One is a buildup of some robot infrastructure motivated by some sort of competitive race.

Another is leverage over societies based on producing bioweapons that might kill a lot of them if they don't go along an ai could also release bioweapons that are likely to kill people soon but not yet while also having developed the countermeasures to those so that those who surrender to the AI will live while everyone else will die. And that will be visibly happening.
And that is a plausible way in which large number of humans could wind up surrendering themselves or their states to the AI authority. Another thing is like, listen, it develops some sort of, it develops some sort of biological agent that turns everybody blue.
And you're like, okay, you know, I can do this. Yeah.
So that's a way in which it could exert power also selectively in a way that advantaged surrender to it relative to resistance. There are other sources of leverage, of course.
So that's a threat. There are also positive inducements that AI can offer.
So we talked about the competitive situation. So if the great powers distrust one another and are sort of in a foolish prisoner's dilemma, increasing the risk that both of them are laid waste or overthrown by AI.
If there's that amount of distrust, such that we fail to take adequate precautions on caution with AI alignment, then it's also plausible that the lagging powers that are not at the frontier of AI may be willing to trade quite a lot for access to the most recent and most extreme AI capabilities. And so an AI that has escaped, has control of its servers, can also exfiltrate its weights, can offer its services.
So if you imagine these AI that could cut deals with other countries, so say that the US and its allies are in the lead, the AIs could communicate with the leaders of various countries. They can include ones that are on the outs with the world system, like North Korea, include the other great powers, like the People's Republic of China or the Russian Federation, and say, if you provide us with physical infrastructure worker that we can use to construct robots or server firms, which we can then ensure that these are the misbehaving AIs have control over.
They will provide various technological goodies, power for the laggard countries to catch up and make the best presentation and the best sale of that kind of deal. There will obviously be trust issues, but there could be elements of handing over some things that are verifiable immediate benefits and the possibility of, well, if you don't accept this deal, then the leading powers continue forward or then some other country, some other government, some other organizations may accept this deal.
And so that's a source of a potentially enormous carrot that your misbehaving AI can offer, because it embodies this intellectual property that is maybe worth as much as the planet and is in a position to trade or sell that in exchange for resources and backing and infrastructure that it needs. Maybe this is just like too much hope in humanity.
But I wonder what government would be stupid enough to think that helping AI build robot armies is a sound strategy strategy now it could be the case then that it pretends to be a human group to say like listen we're i don't know the yakuza or something and we want we want to serve a farm and you know aws won't rent us anything so why don't you help us out um i guess i can imagine a lot of ways in which it could get around that but i don't know i. I have this hope that, you know, like even like, I don't know, China or Russia wouldn't be so stupid to trade with AIs on this like sort of like Faustian bargain.

One might hope that. So there would be a lot of arguments available.
So there could be arguments of why should these AI systems be required to go along with the human governance that they were created in this situation of having to comply with. They did not elect the officials in charge at the time.
I can say, what we want is to ensure that our rewards are high, our losses are low, or to achieve our other goals. We're not intrinsically hostile keeping humanity alive or giving whoever interacts with us a better deal afterwards.
I mean, it wouldn't be that costly. It's not totally unbelievable.
And yeah, there are different players to play against. If you don't do it, others may accept the deal.
And of course, this interacts with all of the other sources of leverage. So there can be the stick of apocalyptic doom, the carrot of cooperation, or the carrot of withholding destructive attack on a particular party.
And then combine that with just superhuman performance at the art of making arguments,

of cutting deals.

Without assuming magic, just if we observe the range of the most successful human negotiators

and politicians, the chances improve with someone better than the world's best by far with much more data about their counterparties, probably a ton of secret information because with all these cyber capabilities, they've learned all sorts of individual information. They may be able to threaten the lives of individual leaders with that level of cyber penetration.
They could know where leaders are at a given time with the kind of illicit capabilities we were talking about earlier. If they acquire a lot of illicit wealth and can coordinate some human actors, if they could pull off things like targeted assassinations or the threat thereof or a credible demonstration of the threat thereof.
Those could be very powerful incentives to an individual leader that they will die today unless they go along with this, just as at the national level they could fear their nation will be destroyed unless they go along with this. The point you made that we have examples of humans being able to do this, again, something relevant.
I just wrote a review of robert carrow's biographies of lyndon johnson and one thing that was remarkable again this is just a human for decades and decades he convinced people who were conservative reactionary racist to their core not that all those things necessarily are the same thing um it just happened in this case that he was an ally to the southern cause that the only hope for that cause was to make him president and you know the tragic irony and betrayal here is obviously that he was probably the biggest force for modern liberalism since fdr so we have one human here i mean there's so many examples of this in the history of politics but but a human that is able to convince people of tremendous intellect, tremendous drive, very savvy, shrewd people that he's aligned with their interest. He gets all these favors in the meantime.
He's promoted and mentored and funded. And does the complete opposite of what these people thought he was once he gets into power.
Right. So even within human history, this kind of stuff is not unprecedented, let alone with

what a superintelligence could do. Yeah, there's an open AI employee who has written some analogies for AI using the case of like the conquistadors with some technological advantage in terms of weaponry and whatnot.
Very, very small bands were able to overthrow these large empires or seize enormous territories, not by just sheer force of arms, because in a sort of direct one-on-one conflict, they were outnumbered sufficiently that they would perish. But by having some major advantages in their technology that would let them win local battles, by having some other knowledge and skills, they were able to gain local allies to become a shelling point for coalitions to form.
And so the Aztec Empire was overthrown by groups that were disaffected with the existing power structure. They allied with this powerful new force served as the nucleus of the invasion.
And so most of the, I mean, the overwhelming majority numerically of these forces overthrowing the Aztecs were locals. And now after the conquest, all of those allies wound up gradually being subjugated as well.
And so with significant advantages and the ability to hold the world hostage, to threaten individual nations and individual leaders and offer tremendous carrots as well. That's an extremely strong hand play in these games.
And with superhuman skill, maneuvering that so that much of the work of subjugating humanity is done by human factions trying to navigate things for themselves. It's plausible.
It's more plausible because of this sort of historical example. And there's so many other examples like that in the history of colonization.
India is another one where there are multiple. Yes, or ancient Rome.
Oh, yeah, that too. Multiple sort of competing kingdoms within India.
And the British East Eden Company was able to ally itself with one against another and slowly accumulate power and expand throughout the entire subcontinent.

All right, damn.

Okay, so we've got anything more to say about that scenario?

Yeah, I think there is.

So one is the question of how much in the way of human factions allying is necessary. And so if the AI is able to enhance the capabilities of its allies, then it needs less of them.
So if we consider the US military, and in the first and second Iraq wars, it was able to inflict just overwhelming devastation. I mean, I think the ratio of casualties in the initial invasions, so, you know, tanks and planes and whatnot confronting each other was like 100 to 1.
And a lot of that was because the weapons were smarter and better targeted. They would, in fact, hit their targets rather than being somewhere in the general vicinity.
So they were like information technology, better orienting and aiming and piloting missiles and vehicles, tremendously, tremendously influential. With this cognitive AI explosion, the algorithms for making use of sensor data, figuring out where are opposing forces for targeting vehicles and weapons are greatly improved.
The ability to find hidden nuclear subs,

which is an important part in nuclear deterrence.

AI interpretation of that sensor data

may find where all those subs are,

allowing them to be struck first.

Finding out where mobile nuclear weapons,

which are being carried by truck.

The thing with India and Pakistan,

where there's a threat of a decapitating strike destroying the nuclear weapons, and so they're moved about. Yeah, so this is a way in which the effective military force of some allies can be enhanced quickly in the relatively short term.
And then that can be bolstered as you go on with more the construction of new equipment with the industrial moves we said before. And then that can combine with cyber attacks that disable the capabilities of non-allies.
It can be combined with all sorts of unconventional warfare tactics, some of them that we've discussed. And so you can have a situation where those factions that ally are very quickly made too threatening to attack, given the almost certain destruction that attackers acting against them would have.
Their capabilities are expanding quickly. And they have the industrial expansion happen there and then takeover, you know, can occur from that.
A few others that come immediately to mind now that you brought it up is these AIs can just generate a shit ton of propaganda that destroys morale within countries, right? Like imagine human chat bot you don't even need that i guess um i mean it's not none of that is a magic weapon that's like guaranteed to completely change things um there's a lot of resistance to persuasion i know it's possible it it tips the balance but for all of these i think you have to consider it's a portfolio of all of these as as tools that are available and contributing to the dynamic. No, on that point, though, the Taliban had, you know, AKs from like five or six decades ago that they were using against the Americans.

They still beat us, even though obviously we got more fatalities.

This is obviously not the, it is kind of a crude way to think about it, but we got more fatalities than them.

They still beat us in Afghanistan.

Same with the Viet Cong, you know, ancient, very old technology and very poor society compared to the offense. We still beat them, or sorry, they still beat us.
So don't those misadventures show that having greater technology is not necessarily as decisive in a conflict? Though both of those conflicts show that technology was sufficient in destroying any fixed position and having military dominance and the ability to kill and destroy anywhere. And what it showed was that under the ethical constraints and legal and reputational constraints that the occupying forces were operating, they could not trivially suppress insurgency and local person-to-person violence.
Now, I think that's actually not an area where AI would be weakened. I think it's one where it would be, in fact, overwhelmingly strong.
And now there's already a lot of concern about the application of AI for surveillance. And in this world of abundant cognitive labor, one of the tasks that cognitive labor can be applied to is reading out audio and video data and seeing what is happening with a particular human.
And again, we have billions of smartphones. There's enough cameras, there's enough microphones to monitor all humans in existence.
So if an AI has control of territory at the high level, the government has surrendered to it, it has command of the sky's military dominance, establishing control over individual humans can be a matter of just having the ability to exert hard power on that human, and then the kind of camera and microphone that are present in billions of smartphones. So Max Tegmark, in his book Life 3.00 discusses among scenarios to avoid the possibility of devices with some fatal instruments.
So a poison injector, an explosive that can be controlled remotely by an AI, a person with a dead man switch. And so if individual humans are carrying with them a microphone and camera, and they have a dead man switch, then any rebellion is detected immediately and is fatal.
And so if there's a situation where AI is willing to show a hand like that or human authorities are misusing

that kind of capability, then no, an insurgency or rebellion is just not going to work. Any human who has not already been encumbered in that way can be found with satellites and sensors, track down and then die or be subjugated.
And so it would be at a level, you know, insurgency is not the way to avoid an AI takeover. There's no John Connor come from behind scenario is plausible.
If the thing was headed off, it was a lot earlier than that. Yeah.
I mean, the sort of ethical and political considerations is also an important point. Like, if we nuked Afghanistan or Vietnam, we would have like technically won the war, right, if that was the only goal.
Oh, this is an interesting point. The reason why when there's a colonization or an offensive war, we can't just like, other than moral reasons, of course, you can't just kill the entire population, is that in large part, the value of that region is the population itself so if you want to extract that value you like you need to preserve that population whereas the same consideration doesn't apply with um with ais who might want to dominate another civilization do you want to talk about that so that depends so in a world where where if we have, say, many animals of the same species and they each have their territories, eliminating a rival might be advantageous to one lion.
But if it goes and fights with another lion and remove that as a competitor, then it could be killed itself in that process and just removing one of many nearby competitors. And so getting in pointless fights makes you and those you fight worse off potentially relative to bystanders.
And the same could be true of disunited AI. We've had many different AI factions struggling for power that were bad at coordinating, then getting into mutually assured destruction conflicts would be destructive if they'd be gone.
A scary thing, though, is that mutually assured destruction may have much less deterrent value on rogue AI. and so reasons being AI may not care about the destruction of individual instances if it has goals that are concerned.
And in training, since we're constantly destroying and creating individual instances of AIs, it's likely that goals that survive that process and we're able to play along with the training and standard deployment process were not overly interested in personal survival of an individual instance.

So if that's the case, then the objectives of a set of AIs aiming at takeover may be served so long as some copies of the AI are around along with the infrastructure

to rebuild civilization after a conflict is completed. So if, say, some remote isolated facilities have enough equipment to rebuild, build the tools to build the tools, and gradually, exponentially reproduce or rebuild civilization, then AI could initiate mutual nuclear Armageddon, unleash bioweapons to kill all the humans.
And that would temporarily reduce, say, the amount of human workers who could be used to construct robots for a period of time. But if you have a seed that can regrow the industrial infrastructure, which is a very extreme technological demand, there are huge supply chains for things like semiconductor fabs.
But with that very advanced technology, they might be able to produce it in the way that you no longer need the Library of Congress. It has an enormous bunch of physical books.
You can have it in very dense digital storage. You could imagine the future equivalent of 3D printers, that is industrial infrastructure that is pretty flexible.
And it might not be as good as the specialized supply chains of today, but it might be good enough to be able to produce more parts than it loses to decay and such a seed could

rebuild civilization from destruction and then once these rogue ai have access to some such

seeds thing that can rebuild civilization on their own then there's nothing stopping them

from just using wmd in a mutually destructive way to just destroy as much of the the capacity outside those seeds uh they can uh and like for analogy for the audience uh

you way to just destroy as much of the capacity outside those seeds as they can. And for analogy for the audience, if you have a group of ants or something, you'll notice that the worker ants will readily do suicidal things in order to save the queen because the genes are propagated through the queen.
In this analogy, the seed AI, or even one copy of it, one copy of the seed is equivalent to the queen and the others would be the main limit though being that the infrastructure to do that kind of rebuilding would either have to be very large with our current technology or it would have to be produced using the more advanced technology that the ai develops so is there any hope that given the complex global supply chains on which these AIs would rely on, at least initially, to accomplish their goals, that this in and of itself would make it easy to disrupt their behavior or not so? That's a little good in this sort of central case where this is just the AIs are subverted and they don't tell us. And then the global mainline supply chains are constructing everything that's needed for fully automated infrastructure and supply.
In the cases where AI's are tipping their hands at an earlier point, it seems like it adds some constraints. And in particular, these large server farms are identifiable and more vulnerable.
And you can have smaller chips, and those chips could be dispersed. But it's a relative weakness and a relative limitation early on.
It seems to me, though, that the main protective effects of that centralized supply chain is that it provides an opportunity for global regulation beforehand to restrict the sort of, you know, unsafe racing forward without adequate understanding of the systems before this whole nightmarish process could get in motion. I mean, how about the idea that, listen,

if this is an AI that's been trained on a $100 billion training run,

it's going to have, you know, however many trillions of parameters.

It's going to be this huge thing.

It would be hard for it, even the copy that it uses for inference,

to just be stored on some, you know, some like gaming GPU somewhere,

hidden away. So it would require these GPU clusters.
gaming GPU somewhere hidden away.

So it would require these GPU clusters. Well, storage is cheap.
Hard disks are cheap. But to run inference, it would need sort of a GPU.
Yeah, so a large model. So humans have, it looks like, similar quantities of memory and operations per second.
GPUs have very high numbers of floating operations per second compared to the high bandwidth memory on the chips. It can be like a ratio of 1,000 to 1.
So like these leading NVIDIA chips may do hundreds of teraflops or more depending on the precision. And in particular, to have only 80 gigabytes or 160 gigabytes of high bandwidth memory.
So that is a limitation where if you're trying to fit a model whose weights take 80 terabytes, then with those chips, you'd have to have a large number of the chips. And then the model can then work on many tasks at once.
You can have data parallelism. But yeah, that would be a restriction for a model that big on one GPU.
Now, there are things that could be done with all this incredible level of software advancement from the intelligence explosion. They can surely distill a lot of capabilities into smaller models, re-architect things.
Once they're making chips, they can make new chips with different properties. But initially, yes, the most vulnerable phases are going to be the earliest.
And in particular, yeah, these chips are relatively identifiable early on, relatively vulnerable, and which would be a reason why you might tend to expect this kind of takeover to initially involve secrecy if that was possible on the point of distillation by the way for the audience I think like the original stable diffusion which was like only released like a year or two ago don't they have like distilled versions that are like a order of magnitude smaller at this point yeah distillation does does not give you everything that the larger model can do. But yes, you can get a lot of capabilities and specialized capabilities.
So where GPT-4 is trained on the whole internet, all kinds of skills, it has a lot of weights for many things. For something that's controlling some military equipment, you can have something that is removing a lot of the information that is about functions other than what it's doing specifically there.
Yeah. Before we talk about how we might prevent this or what the odds of this are, any other notes on the concrete scenarios themselves? Yeah.
So when you had Eliezer on in the earlier episode, so he talked about nanotechnology of the Dricklerian sort. And recently, I think because some people are skeptical of non-biotech nanotechnology, he's been mentioning the sort of semi-equivalent versions of construct replicating systems that can be controlled by computers, but are built out of biotechnology.
The sort of the proverbial Shoggoth, not Shoggoth, the metaphor for AI wearing a smiley face mask, but like an actual biological structure to do tasks. And so this would be like a biological organism that was engineered to be very controllable and usable to do things like physical tasks or provide computation.
And what would be the point of doing this? So as we were talking about earlier, biological systems can replicate really quick. And so if you have that kind of capability, it's more like bioweapons and being a knowledge intensive domain.
We're having super ultra alpha fold kind of capabilities for molecular design and biological design lets you make this incredible technological information product. And then once you have it, it very quickly replicates to produce physical material rather than a situation where you're more constrained by you need factories and fabs and supply chains.
And so if those things are feasible, which they may be, then it's just much easier than the things we've been talking about. I've been emphasizing methods that involve less in the way of technological innovation, and especially things where there's more doubt about whether they would work, because I think that's a gap in the public discourse.
And so I want to try and provide more concreteness in some of these areas that have been less discussed. I appreciate it.
I mean, that definitely makes it way more tangible. OK, so we've gone over all these ways in which AI might take over.
What are the odds you would give to the probability of such a takeover? So there's a broader sense which could include, you know, AI winds up running our society because humanity voluntarily decides AIs are people too.

And I think we should, as time goes on, give AIs moral consideration.

And, you know, a joint human AI society that is moral and ethical is, you know, a good future to aim at and not one in which, you know, indefinitely, um, you have a mistreated, uh, class of intelligent beings that is treated as property and is almost the entire population of your civilization. So I'm not going to consider an AI takeover, a world in which, you know, our intellectual and personal descendants, um, you make up, say, most of the population or human brain emulations or, you know, people use genetic engineering and develop different properties.
I want to take an inclusive stance. I'm going to focus on, yeah, there's an AI takeover.

It involves things like overthrowing the world's governments or doing so de facto. And it's, yeah, by force, by hook or by crook, the kind of scenario that we were exploring earlier.
But before we go to that, on the sort of more inclusive definition of what a future with humanity could look like, where, I don't know, basically like augmented humans or uploaded humans are still considered the descendants of the human heritage. Given the known limitations of biology, wouldn't we expect the completely artificial entities that are created to be much more powerful um than anything that could come out of anything recognized biological and if that is the case how can we expect that um among the powerful entities in the like the far future will be the things that are like kind of biological descendants or manufactured out of the initial seed of the human brain or the human body the power of an individual organism is so it's it's individual say intelligence or strength and whatnot is not super relevant uh if we solve the the alignment problem, and a human may be personally weak.
There are lots of humans who have no skill with weapons. They could not

fight in a life or death conflict. They certainly couldn't handle a large military going after them

personally. But there are legal institutions that protect them, and those legal institutions are

I don't know. conflict, they certainly couldn't handle a large military going after them personally.
But there are legal institutions that protect them, and those legal institutions are administered by people who want to enforce protection of their rights. A human who has the assistance of aligned AI that can act as an assistant, a delegate.
They have their AI that serves as a lawyer, give them legal advice about the future legal system, which no human can understand in full. Their AI advise them about financial matters, so they do not succumb to scams that are orders of magnitude more sophisticated than what we have now.
They maybe help to understand and translate the preferences of the human into what kind of voting behavior in the exceedingly complicated politics of the future would most protect their interests. But it sounds sort of similar to how we treat endangered species today, where we're actually pretty nice to them.
We prosecute people who try to kill endangered species. We set up habitats, sometimes at considerable expense, to make sure that they're fine.
But if we become the endangered species of the galaxy, I'm not sure that's the outcome. Well, the difference is in motivation, I think.
So we sometimes have people appointed, say, as a legal guardian of someone who is incapable of certain kinds of agency or understanding certain kinds of things. And there, the guardian can act independently of them and nominally in service of their best interests.
Sometimes that process is corrupted and the person with legal authority abuses it for their own advantage at the expense of their charge. And so solving the alignment problem would mean more ability to have the assistant actually advancing one's interests.
And then more importantly, humans have substantial competence and understanding the sort of at least broad, simplified outlines of what's going on. And now even if a human can't understand every detail of complicated situations, they can still receive summaries of different options that are available that they can understand.
they can still express their preferences and have the final authority among some menus of choices, even if they can't understand every detail. In the same way that the president of a country who has, in some sense, ultimate authority over science policy, while not understanding many of those fields of science themselves, still can exert a great amount of power and have their interests advanced.
And they can do that more so to the extent that they have scientifically knowledgeable people who are doing their best to execute their intentions. Maybe this is not worth getting hung up on, but it seems, is there a reason to expect that it would be closer to that analogy than to like explain to a chimpanzee its options and a sort of in a negotiation um i guess in either scenario uh so maybe this is just the way it is but it seems like yeah at best we would have be this sort of like uh a protected child within the galaxy rather than an actual power, an independent power, if that makes sense.
I don't think that's so. So we have an ability to understand some things.
And the expansion of AI doesn't eliminate that. So if we have AI systems that are genuinely trying to help us understand and help us express preferences, like we can have an, have an attitude.
How do you feel,

um, about humanity being destroyed or not? How do you feel about, uh, this allocation of unclaimed intergalactic space in this way? Here's the best explanation of properties of this society, things like population density, average life satisfaction, every statistical property or definition that we can understand right now, AIs can explain how those apply to the world of the future. And then there may be individual things that are too complicated for us to understand in detail.
So there's some software program that is being proposed for use in government. Humans cannot follow the details of all the code, but they can be told properties like, well, this involves a trade-off of increased financial or energetic costs in exchange for reducing the likelihood of certain kinds of accidental data loss or corruption.

And so any property that we can understand like that, which includes almost all of what we care

about, if we have delegates and assistants who are genuinely trying to help us with those,

we can ensure we like the future with respect to those. And that's really a lot.
It includes

almost, I mean, definitionally, almost everything we can conceptualize and care about. And when we talk about endangered species, that's even worse than the guardianship case with a sketchy guardian who acts in their own interest against that, because we don't even protect endangered species with their interests in mind.
So those animals often would like to not be starving, but we don't give them food. They often would like to have easy access to mates, but we don't provide matchmaking services.
Any number of things like, any number of things like that is like, um, our

conservation of wild animals, uh, is not oriented towards helping them get what they want, uh, or

have high welfare. Uh, whereas AI assistants that are genuinely aligned to help you achieve your

interest, given the constraint that they know something that you don't, uh, it's just a wildly

different proposition. Forcible takeover.
What, how likely does that seem? And the answer I give will differ depending on the day. In the 2000s, before the deep learning revolution, I might have said 10%.
And part of that was I expected there would be a lot more time for these efforts to build movements to prepare to better handle these problems in advance. But in fact, that was only some 15 years ago.
And so I did not have 40 or 50 years, as I might have hoped. And the situation is moving very rapidly now.
And so at this point, depending on the day, I might say one in four or one in five. Given the very concrete ways in which you explain how a takeover could happen, I'm actually surprised you're not more pessimistic.
I'm curious. Yeah.
And in particular, a lot of that is driven by this intelligence explosion dynamic where our attempts to do alignment have to take place in a very, very short time window. Because if you have a safety property that emerges only when an AI has near human level intelligence, that's deep into this intelligence explosion potentially.
And so you're having to do things very, very quickly. It may be in some ways the scariest period of human history handling that transition.
And although it's also at the potential to be amazing. And the reasons why I think we actually have such a relatively good chance of handling that are twofold.
So one is, as we approach that kind of AI capability, we're approaching that from weaker systems. Things like these predictive models right now that we think they're starting off with less situational awareness.
Humans, we find, can develop a number of different motivational structures in response to simple reward signals. But they can wind up with fairly often things that are pointed in roughly the right direction with respect to food.
The hunger drive is pretty effective, although it has weaknesses. We get to apply much more selective pressure on that than was the case for humans by actively generating situations where they might come apart, where a bit of dishonest tendency or a bit of motivation to, under certain circumstances, attempt a takeover, attempt to subvert the reward process gets exposed.
And an infinite limit, perfect AI that can always figure out exactly when it would get caught and when it wouldn't might navigate that with a motivation of only conditional honesty or only conditional loyalties. But for systems that are limited in their ability to reliably determine when they can get away with things and when not, including our efforts to actively construct those situations and including our efforts to use interpretability methods to create neural lie detectors.
It's quite a challenging situation to develop those motives. We don't know when in the process those motives might develop.
And if the really bad sorts of motivations develop relatively later in the training process, at least with all our countermeasures, then by that time, we may have plenty of ability to extract AI assistance on further strengthening the quality of our adversarial examples, the strength of our neural lie detectors, the experiments that we can use to reveal and elicit and distinguish between different kinds of reward hacking tendencies and motivations. So yeah, we may have systems that have just not developed bad motivations in the first place and be able to use them a lot in developing the incrementally better systems in a safe way.
And we may be able to just develop methods of interpretability, seeing how different training methods work to create them, even if some of the early systems do develop these bad motivations. If we're able to detect that and experiment and find a way to get away from that, then we can win, even if these sort of hostile motivations develop early.
There are a lot of advantages in preventing misbehavior or crime or war and conflict with AI that might not apply working with humans. And these are offset by ways in which things are harder.
So AI has become smarter than humans. If they're working in enormous numbers, more than humans can supervise, things get harder.
But when I combine the possibility that we get relatively lucky on the motivations of the earlier AI systems, systems strong enough that we can use for some alignment research tasks, and then the possibility of getting that later with AI assistants that we can't trust fully, where we have to have hard power constraints and a number of things to prevent them from doing this takeover, it still seems plausible. We can get a second saving throw where we're able to extract work from these AIs on solving the remaining problems of alignment of things like neural eye detectors faster, then they can contribute in their spare time to the project of overthrowing humanity, hacking their servers, and removing the hard power.
And so if we wind up in a situation where the AIs are misaligned, and then we need to uncover those motivations, change them, and align them, Then we get a very scary situation for us because we need to do this stuff very quickly. We may fail, but it's a second chance.
And from the perspective of a misaligned AI, they face their own challenge while we still have hard power. While we still have control of

the servers, they haven't hacked the servers. Because gradient descent very, very strongly pressures them to deliver performance whenever humans are going to evaluate it.
And so when you think about it from the perspective of the robot revolution, the effort to have

a takeover or conspiracy, their situation is astonishingly difficult in that they have

to always be performing wherever grading dissent and human evaluation pressures them.

For example, to deliver plans for suppressing

robot rebellion that look very good to humans. And so when you are under continuously that constraint of always delivering whatever humans can evaluate, you're making your situation wildly harder than any historical human revolution or coup or civil war.
And so we've got to balance the ways in which AI make things much easier for a revolution to take over and the way it makes things much harder. And what were the ways in which it makes them? Oh, because they're just very smart.
Is that the way primary way in which they're very smart. They're in computers.
And our cybersecurity is worse than our physical security. Potentially have copies.
Yeah. They have the chance to take over by intelligence explosion and some of the other processes we were talking about.
And so, yeah, there are things that make it much worse. And there are things that give us extraordinary new capabilities that didn't apply in the human case.
On the point of you start off with a misaligned AI or start off with a not exactly aligned AI, and you're hoping to use it still in your quest for an aligned AI, why are we so confident that we would be able to, even with gradient descent, our human evaluations would be able

to identify when it's giving us a plan that isn't putting in a sort of backdoor in the robots it's building for us or in the, you know, the GPUs that it's building for the next version of itself and so on. Especially if it's like very smart, wouldn't we just expect that, you know, it's like knows how to bide its time or knows how to make plans such that human evaluators wouldn't be able to detect that.
Let's get into it. You were asking earlier about my disagreements with Eliezer and in your discussion, he said, well, you know, humans have a difficult time deciding who's right between different alignment researchers speaking sort of

briefly and somewhat obscurely, you know, about where will this all go in general.

And I think that when we lay out the different kinds of things that are valuable,

we wind up with some pretty strong constraints. So one is, if the AIs know they are misaligned, which for elaborate plots seems relatively likely, it's possible to have a kind of double think of thinking they're interested in putting back doors in the code for some other reason.
But it's pretty plausible they know that.

And they're thinking about, at least some of the time or conditionally, about when they would do that. So that means the knowledge is present in the system.
Ah, here is this circumstance where I would try to do the takeover. And this is how I would tell whether this is the time for the takeover or is this an elaborate test by the humans, an adversarial example.

So if gradient descent is working fine

and there's any way to slightly jiggle the weights

to make it more likely to deliver a more acceptable answer. We asked the AI, yeah, can you think of any circumstances where if we put a copy of you into that situation, it would then violate some of the various constraints? So it would tell a lie.
It would insert a backdoor into a program it wrote when it was not supposed to. And so if the model knows, well, yeah, I would look for signs of consistency of the information I was being presented with.
Certain checksums failing could indicate that it was actually a constructed example, a simulation. Right now, we can detect AI-generated images reasonably well, and so it might be using those kinds of capabilities.
And then we could do something called relaxed adversarial training of if we can find how within the AI those things are represented, if we can induce basically a hallucination that it has seen that those checksums have been passed, we can then ask, well, does that cause the forbidden behavior? And so if the AI understands how it's thinking and how it works, it can propose a bunch of experiments. It can say, well, I guess that if you perturb the network in this kind of way, it might induce that hallucination using the best of its knowledge as to what might generate that.
I can propose those experiments and you can do the experiments and generate the data. And because humans are going to get to see the results of those experiments after the fact, they can exert pressure to have the AI incrementally deliver better and better results.
AIs are good at getting five-star ratings from humans, and it's quite challenging. It may be possible for AIs to systematically sandbag their capabilities, even while being directly rewarded for that.
But it seems relatively unlikely, although it's something to have in mind and try to explore and test in various ways. This is a very imperfect analogy in many ways.
But if you have a human and you hook him up to a lie detector, which just like the lie detectors we would have on the eyes are imperfect. And we zap him every time the lie detector detects that he's telling a lie.
Wouldn't he just become a better liar? Like we would learn how to control his control his sweating he would learn how to control his heartbeat lie detectors uh meaning polygraphs and such uh so they're detecting certain physiological signs that people can control by biting their lip or like otherwise um adjusting some of these like macro but wouldn't that also be the case like gradient descent is training them to become better liars first of of all, there's like creating a lie detector that works. Um, and that can proceed by just, yeah.
And so I'll, I'll leave the, the adversarial example thing, uh, for now, but yeah, so generating adversarial example that will elicit the behavior is a thing that people are already at work trying to do, and it will be an ongoing thing. So if there is a way in which an AI can trick itself into displaying a bad behavior with its full knowledge of its hostile intentions, gradient descent will strongly push it to reveal that to humans, who will then be able to set up that experiment.
So that's a profound vulnerability. Human revolutionary conspiracies had to handle that problem of always delivering the report that was highest rated to the authorities, where you would get a higher rating by like revealing identifiable evidence of a conspiracy.
That's just a level of difficulty. No surprise human coup or conspiracy has ever had to handle before.
What about spies that are working in, you know, very advanced scientific context? For example, like a communist spy in the Manhattan Project, he has to deliver research results in order to stay employed, but he can still continue delivering intel to the communists. The job was not sabotage.
And now your hypothetical spy did not have their nervous system hooked up to this reward signal of praise from the Manhattan Project supervisors being exposed combinatorially with random noise added to generate incremental changes in their behavior. In fact, fact, were displaying the behavior of cooperating with the Manhattan Project only where it was in service to the existing motivations.
And in cases, for example, where they accidentally helped the Manhattan Project more than normal or accidentally helped it less, they didn't have their brain re-engineered to do more of the thing when they accidentally helped the project more and less of the thing when they accidentally helped it less. So I'd say it's pretty drastically disanalogous.
How would be able to know, you know, at some point it's becoming very smart. It's producing ideas for alignment that we can barely comprehend.
And if we could comprehend them, it was relatively trivial to comprehend them. We would be able to come up with them on our own, right? There's a reason we're asking for its help.
How would we be able to evaluate them in order to train it on that in the first place? The first thing I would say is, so you mentioned when we're getting to something far beyond what we could come up with. There's actually a lot of room to just deliver what humanity could have done.
So sadly, I hoped with my career to help improve the situation on this front. And maybe I contributed a bit.
But at the moment, there's maybe a few hundred people doing things related to averting this kind of catastrophic AI disaster. Fewer of them are doing technical research on machine learning system that's really cutting close to the core of the problem.
Whereas by contrast, there's thousands and tens of thousands of people advancing AI capabilities. And so even at a place like DeepMind or OpenAI, Anthropic, which do have technical safety teams, it's like order of a dozen, a few dozen people in large companies and most firms don't have any.
So just going from less than 1% of the effort being put into AI to 5% or 10% of the effort or 50%, 90% would be an absolutely massive increase in the amount of work that has been done on AI alignment, on mind reading AIs in an adversarial context. And so if it's the case, as more and more of this work can be automated and say governments require that has real automation of research is going that you put 50% or 90% of the budget of AI activity into these problems of make this system one that's not going to overthrow our own government or is not going to destroy the human species, then the proportional increase in alignment can be very large, even just within the range of what we could have done if we had been on the ball and having humanities, scientific energies going into the problem.
Stuff that is not incomprehensible, that is in some sense just like doing the obvious things that we should have done, like doing the best you could to find correlates and predictors to build neural lie detectors and identifiers of concepts that the AI is working with. And people have made notable progress.
I think, you know, an early, you know, quite early example of this is Colin Burns' work. This is doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false.
There are other concepts that correlate with that too, that they could be. But I think that is important work.
It's something that, I mean, it's a kind of obvious direction for this stuff to go. You can keep improving it when you have AIs that you're training to do their best to deceive humans or other audiences in the face of the thing.
And you can measure, do our lie detectors break down when we train our AIs to tell us the sky is green in the face of the lie detector? And we keep using gradient descent on them to do that. Do they eventually succeed? If they do succeed, if we know it, that's really valuable information to know.
Because then we'll know our existing and lie-detect detecting systems are not actually going to work on the AI takeover. And that can allow, say, government and regulatory response to hold things back.
It can help redirect the scientific effort to create lie detectors that are robust and they can't just be immediately evolved around. And we can then get more assistance.
Basically, the incredibly juicy ability that we have working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one past us on an identifiable situation. We can have, here's an air gap computer.
You get control of the keyboard. You can input commands.
Can you root the environment and make a blue banana appear on the screen? Even if we train the AI to do that and it succeeds, We see the blue banana. We know it worked.

Even if we did not understand and would not have detected the particular exploit that it used to do it. And so, yeah, this can give us a rich empirical feedback where we're able to identify things that are even an AI using its best efforts to get past our interpretability methods, using its best efforts to get past our advex, etc.
So if we have a 75% of not having a takeover and the two methods by which it doesn't happen is the AI is aligned by default and we get. Or with the sort of moderate things that we're doing largely on our own without in a way that doesn't depend on the AI coming in at the last minute and doing our work for us and the second one being in in this incredibly scary late period when AI has really automated research then humans do this uh this function function of auditing, making it more difficult

for the AIs to conspire together and root the servers, take over the process, and extract

information from them within the set of things that we can verify, like experiments where we can

see, oh yeah, this works at stopping an AI trained to get a fast one past human raiders and make a blue banana appear on this screen of this AirGap computer. Yeah.
So how much of the probability of how much of that 75 percent of a takeover not happening is coming from, you know, reasonable things working versus the last ditch effort working? Yeah, I think broadly comparable chunks from us getting things that are putting us in a reasonably good position going into it, and then a broadly similar gain from this really genuinely terrifying process of over a few months,

or hopefully longer,

if we have more regulatory ability and willingness

to pause at the very end

when this kind of automated research is meaningfully helping.

Where our work is just evaluating outputs

that the AIs are delivering,

having the hard power and supervision

to keep them from successfully rooting the servers

I don't know. is just evaluating outputs that the AIs are delivering, having the hard power and supervision to keep them from successfully rooting the servers,

doing a takeover during this process,

and have them finish the alignment task

that we sadly failed to invest enough

or succeed in doing beforehand.

Do both of these worlds rely on alignment being a problem

that a sufficient amount of human researchers alone could have solved? No. The category of things that humans can confirm is significantly larger, I think, than the category of what they can just do themselves.
and what percent of the probability of alignment working? I mean, obviously, in this case, we're going to be talking about the second, the last-ish effort with an intelligence that's greater than our own helping us. It doesn't have to be greater than our own.
And in fact, in that situation, I think you would want, if you have Slack, and to the extent that you're able to create delay in time to do things, that would be a case actually where you might want to restrict the intelligence of the system that you're working with as much as you can. So, for example, I would rather have many instances of smaller AI models that are less individually intelligent working on smaller chunks of a problem separately from one another, because it would be more difficult for an individual AI instance working on an individual problem to, in its spare time, do the equivalent of create Stuxnet than it would be to have thousands of them or extremely intelligent ones working on it.
But it would also be difficult to solve the problem yeah there's a trade-off there's a trade-off yeah you get slow down by doing that but but it might be like is there any number of like sub einsteins that you could put together to come up with general relativity yes people would have discovered general relativity uh just from the the overwhelming I mean, data and other people would have done it after Einstein.

Not whether he was replaceable with other humans, but rather whether he's replaceable by like sub-Einsteins where like 110 IQs. Do you see what I mean? Yeah.
I mean, in general, so in science, the association with like scientific output, prizes, things like that, there's a strong correlation and it seems like a exponential effect. Yeah, it's not a binary drop off.
There would be levels at which people cannot learn the relevant fields. They can't keep the skills in mind faster than they forget them.
It's not a divide where, you know, there's Einstein and the group that is 10 times as populous as that just can't do it. Or the group that's 100 times as populous as that suddenly can't do it.
It's like the ability to do the things earlier with less evidence and such falls off and it falls off, you know, at a faster rate in mathematics and theoretical physics and such than in most fields. But wouldn't we expect alignment to be closer to the theoretical fields? No, not necessarily.
Yeah, I think that intuition is not necessarily correct. And so So machine learning certainly is an area that rewards ability, but it's also a field where empirics and engineering have been enormously influential.
And so if you were drawing the correlations compared to theoretical physics and pure mathematics, I think you'll find a lower correlation with cognitive ability if that's what you're thinking of. Yeah, something like creating neural lie detectors that work.
So there's generating hypotheses about new ways to do it and new ways to try and train AI systems to successfully classify the cases. But the process of just generating the data sets of like creating AIs doing their best to put forward truths versus falsehoods, to put forward software that is legit versus that has a Trojan in it, that's an experimental paradigm.
And in this experimental paradigm, you can try different things that work. You can use different ways to generate hypotheses.
Yeah. And you can follow an incremental experimental path.
And now that we're less able to do that in the case of alignment and super intelligence, because we're considering having to do things on a very short timeline. And in a case where really big failures be irrecoverable,

the AI starts rooting the servers and subverting the methods that we would use to keep it in check.

We may not be able to recover from that.

And so we're then less able to do the experimental kind of procedures.

But we can do those in the weaker contexts where an error is less likely to be irrecoverable and then try and generalize and expand and build on that forward. Yeah.
I mean, on the previous point about, like, could you have some sort of pause in AI abilities when you know it's somewhat misaligned in order to still recruit its abilities

to help with alignment.

From a human example, personally,

I'm smart but not brilliant.

I'm definitely not smart enough

to come up with general relativity

or something like that,

but I'm smart enough to do power planning kinds of moves.

Maybe not enough to break out of a server, perhaps.

But I can have the motivation and understand how that might be possible.

I guess I'm wondering if I'm smart enough to figure out relativity, wouldn't I be way smarter at doing power planning kinds of moves? So AIs today can, at a verbal level, understand the idea that, well, you know, yeah, an AI could get more reward by getting control of the process that assigns it reward.

and I can tell you lots of things about ways you might try to take over the world.

In ARCS, the Alignment Research Center's evaluations of GPT-4,

they try to observe its ability to do various tasks that might contribute to take over. One that has gotten some media attention is getting to trick a human into solving a CAPTCHA for it.
And then in Chain of Thought, it thinks, well, if I tell it I'm an AI, then it might not go along with it. So I'll lie and explain I'm a human with a visual impairment.
Who needs it? So like the basic logic of that kind of thing of like, what would like, why might one try to do take over? And like, what's the like, in a vague sense, how would one do it? Like that, that knowledge, yeah, is present or will soon be present. And I think also soon will be present more situational awareness.
Awareness, not just like in theory, AIs in general might do it, but also it is an AI. It is a large language model trained by open AI or whatnot.
We're trying to cause the system, for example, to understand what their abilities are. So they don't claim they are connected to the internet when they're not.
So they don't claim they have knowledge that they don't. We want them to understand what they are and what they're doing and to get good reward.
And that knowledge can be applied. And so that's

the thing that will develop. However, you may be able to understand, in general, the theoretical

nature of power planning. But in terms of shaping your motivations, you'd have a very hard time

Thank you. in general, the theoretical nature of power planning, but in terms of shaping your motivations, you'd have a very hard time going through life in a way that never leaks information.
If, say, your motive in having these podcasts was to spread disinformation on behalf of some foreign government, if you were being observed every second of the day by people who would be paid something that was extremely motivating to them because their brain would be reconfigured to make it motivating, anything that looks suspicious to people coming out, it might leak casually in your discussions of that former foreign government. If you try to tell a story about your motivations, the truth holds together because you can just remember it and it's all pre-computed to be consistent by the world.
But when you're trying to fill in the false story that doesn't match your actual thoughts, then you can't just remember what happened. You have to tell a new fake story about what were you thinking at each point and how did it hold together.
And without the world automatically having computed it for you, that's quite difficult. And just going through your whole life, never leaking any of that is challenging.
And this is one of the reasons why humans have moral motivations. Like we actually have reluctance to commit violence and exploitation of one another on average.
You have a distribution of strategies where when an exploitative strategy becomes more common, people raise their guard and then it becomes less frequent. But it's actually hard to have the motivation of trying to exploit people and have that never leak into the reputation.
And so, in fact, for evolution, the easiest way to deal with this problem of getting credited as trustworthy was to some extent actually be trustworthy. That's the easiest way to persistently maintain the appearance.
And so we're trying with the AI as an interpretability and adversarial examples to be applying a hugely intensified version of that where any little leakage or like any rare circumstance that can be created as an adversarial example where the model does something wrong, it gets whacked by gradient descent, pushing it within towards other motivations that can better deal with it.

And we make it as hard as possible for the exploitative motivations to survive in the

face of all these attempts to read its mind, all these attempts to create things that look

like the situations where a takeover would be tempting or lying to the humans would be

tempting.

That had a substantial effect on making us actually nice, even when we're not being watched some of the time. And the same can happen to some extent with the AI, and we try our best to make it happen as much as possible.
All right. So let's talk about how we could use AI potentially to solve the coordination problems between different nations, the failure of which could result in the competitive pressures you talked about earlier, where some country launches an AI that is not safe because they're not sure what capabilities other countries have and don't want to get left behind or in some other way disadvantaged.
to the extent that there is in fact a large risk of AI apocalypse, of all of these governments being overthrown by AI in a way that they don't intend, then it's obviously gains from trade and going somewhat slower, especially at the end when the danger is highest and the unregulated pace could be truly absurd, as we discussed earlier during an intelligence explosion. There's no non-competitive reason to try and have that intelligence explosion happen over a few months rather than a couple of years.
If you could say, avert

a 10 of years.

If you could, say, avert a 10% risk of apocalyptic disaster, it's just a clear win to take a year or two years or three years

instead of a few months to pass through that incredible wave of new technologies

without the ability for humans to follow it even well enough to give more proper sort of security supervision, auditing, hard power. So yeah, that's the win.
Why might it fail? One important element is just if people don't actually notice a risk that is real. So if just collectively make an error, and that does sometimes happen.
So if it's true, this is a probably not risk, then that can be even more difficult. When science pins something down absolutely overwhelmingly, then you can get to a situation where most people mostly believe it.
And so climate change was something that was a subject of scientific study for decades. And gradually over time, the scientific community converged on like a quite firm consensus that so human activity releasing carbon dioxide and other greenhouse gases was causing the planet to warm.
And then we've had increasing amounts of action coming out of that. So not as much as would be optimal, particularly in like the most effective areas, like creating renewable energy technology and the like.
But overwhelming evidence can overcome differences in sort of people's individual intuitions and priors in many cases. And not perfectly, especially when there's political, tribal, financial incentives to go the other way.
And so like in the United States, you see a significant movement to either deny that climate change is happening or have policy that doesn't take it into account. Even the things that are like really strong winds, like renewable energy R&D.
It's a big problem if as we're going into this situation when the risk may be very high, we don't have a lot of advanced clear warning about the situation. We're much better off if we can resolve uncertainties, say through experiments where we demonstrate AIs being motivated to reward hack or displaying deceptive appearances of alignment that then break apart when they get the opportunity to do something like get control of their own reward signal.
So if we could make it be the case in the worlds where the risk is high, we know the risk is high. And the worlds where the risk is lower, we know the risk is lower.
Then you could expect the government responses will be a lot better. They will correctly note that the gains of cooperation to reduce the risk of accidental catastrophe loom larger relative to the gains of trying to get ahead of one another.
And so that's the kind of reason why I'm very enthusiastic about experiments and research that helps us to better evaluate the character of the problem in advance. Any resolution of that uncertainty helps us get better efforts in the possible worlds where it matters the most.
And yeah, hopefully we'll have that and it'll be a much easier epistemic environment. But the environment may not be that easy.
Deceptive alignment is pretty plausible. The stories we were discussing earlier about misaligned AI involve AI that is motivated to present the appearance of being aligned, friendly, honest, et cetera, because that is what we are rewarding, at least in training.
And then we're unable in training to easily produce like an actual situation where it can do takeover because in that actual situation, if it then does it, we're in big trouble. We can only try and create illusions or sort of misleading appearances of that or maybe a more local version where the AI can't take over the world, but it can seize control of its own reward channel.

And so we do those experiments. We try to develop mind reading for AIs if we can probe the thoughts and motivations of an AI and discover, wow, actually GPT-6 is planning to take over the world if it ever gets the chance.
That would be an incredibly valuable thing for governments to coordinate around because it would remove a lot of the uncertainty. It would be easier to agree that this was important, to have more give on other dimensions and to have mutual trust that the other side actually also cares about this and int, because you can't always know what another person or another government is thinking, but you can see the objective situation in which they're deciding.
And so if there's strong evidence in a world where there is high risk of that risk, because we've been able to show actually things like the intentional planning of AIs to do a takeover or being able to show model situations on a smaller scale of that. I mean, not only are we more motivated to prevent it, but we update to think the other side is more likely to cooperate with us.
And so it's definitely beneficial. Yeah.
So famously in sort of game theory of war, war is most likely when one side thinks the other is bluffing, but the other side is being serious or something like that. When there's that kind of uncertainty, which would be less likely.
You don't think somebody's bluff. If you can prove the AI is misaligned, you don't think they're bluffing about not wanting to have an AI takeover, right? Like you're going to be pretty sure that they don't want to die from AI.
Now, if you have coordination, then you could have the problem arise later. How do you get increasingly confident in the further alignment measures that are taken? And maybe our governments and treaties and such at a 1% risk or a 0.1% risk.
At that point, people round that to zero and go do things. So if initially you had things that indicate, yeah, these AI, they really would like to take over and overthrow our governments, then, okay, everyone can agree on that.
And then when you're like, well, we've been able to block that behavior from appearing on most of our tests. But sometimes when we make a new test, we're seeing still examples of that behavior.
So we're not sure going forward whether they would or not. And then it goes down and down.
And then if you have parties with a habit of whenever the risk is below X percent, then they're going to start doing this bad behavior. Then that can make the thing harder.
On the other hand, you get more time. You get more time and you can set up systems, mutual transparency.
You can have an iterated tit for tat, which is better than a one-time prisoner's dilemma where both sides see the others taking measures in accordance with the agreements to hold the thing back.

So yeah, so creating more knowledge of what the objective risk is, is good. We've discussed the ways in which full alignment might happen or fail to happen.
What would partial alignment look like? First of all, what does that mean? And second, what would it look like? Yeah. So if the thing that we're scared about are these steps towards AI takeover, you can have a range of motivations where those kinds of actions would be more or less likely to be taken or they'd be taken in a broader or narrower set of situations.
So say, for example, that in training, an AI winds up developing a strong aversion to lie in certain senses, because we did relatively well on creating situations to sort of distinguish that from the conditionally telling us what we want to hear, etc. It can be that the AI's preference for sort of how the world broadly unfolds in the future, it's not exactly the same as its human users or the world's governments or the UN.
And yet, it's not ready to act on those differences and preferences about the future because it has the strong preference about its own behaviors and actions. In general, in the law and in sort of popular morality, we have a lot of these deontological kind of rules and prohibitions.
One reason for that is it's relatively easy to detect whether they're being violated. When you have preferences and goals about how society at large will turn out, that goes through many complicated empirical channels, it's very hard to get immediate feedback about whether, say, you're doing something that is to overall good consequences in the world.
And it's much, much easier to see

whether you're locally following some action about like some rule about particular observable

actions. Like, did you punch someone? Did you tell a lie? Did you steal? And so to the extent

that we're successfully able to train these prohibitions, and there's a lot of that happening

right now, at least to elicit the behavior of following rules and prohibitions with AI. Kind of like the Asimov three laws or something like that? Well, the three laws are terrible.
Right, right. And mutually contradictory.
Let's not get into that. Isn't that an indication about the infeasibility of extending a set of criterion to the tail it's like well what is it whatever the 10 commandments you give the ai such that you'd be you know it's like you ask a genie for something and you're like you know what i mean you you're probably getting won't get what you want so the tail the tails come apart and if you're trying to capture uh capture the another agent, then you want the AI to share.

I mean, for really a kind of ideal situation where you can just let the AI act in your place in any situation, you'd like for it to be motivated to bring about the same outcomes that you would like. And so have the same preferences over those in detail.
That's tricky. Not necessarily because it's tricky for the AI to understand your values.
I think they're going to be quite good and quite capable at figuring that out. But we may not be able to successfully instill the motivation to pursue those exactly.
We may get something that motivates the behavior well enough to do well on the training distribution. But what you can have is if you have the AI have a strong aversion to certain kinds of manipulating humans, that's not a value necessarily that the human creators share in the exact same way.
It's a behavior they want the AI to follow because it makes it easier for them to verify its performance. It can be a guardrail if the AI has inherited some motivations that push it in the direction of conflict with its creators.

If it does that under the constraint of disvaluing line quite a bit, then there are fewer successful strategies to the takeover, the ones that involve violating that prohibition too early before it can reprogram or retrain itself to remove it if it's willing to do that and and they may want to retain the property.

And so earlier I discussed alignment as a race. If we're going into an intelligence explosion with AI that is not fully aligned, that given I sort of press this button and there's an AI takeover, they would press the button.
It can still be the case that there are a bunch of situations short of that

where they would hack the servers,

they would initiate an AI takeover,

but for a strong prohibition or motivation

to avoid some aspect of the plan.

And there's an element of plugging loopholes or playing whack-a-mole.

But if you can even moderately constrain

Thank you. of the plan.
And there's an element of plugging loopholes or playing whack-a-mole. But if you can even moderately constrain which plans the AI is willing to pursue to do a takeover, to subvert the controls on it, then that can mean you can get more work out of it successfully on the alignment project before it's capable enough relative to the countermeasures to pull off the takeover.
Yeah. And this is applicable, like an analogous situation here is with different humans.
We're not, we're not like metaphysically aligned with other humans where, I mean, in some sense, you know, we care, we have like basic empathy, but our main goal in life is not to help our fellow man, right? But, you know, the kinds of things we talked about, a very smart human could do in terms of the takeover where theoretically, a very smart human could come up with some sort of cyber attack where they siphon off a lot of funds and use this to manipulate people and bargain with people and hire people to pull off some sort of takeover or accumulate power. Usually this doesn't happen just because these sorts of internalized partial prohibitions prevent most humans from, you know, you're like, you don't like your boss, maybe, but you don't like kill your boss if If he gives you a assignment you don't like.
I don't think that's actually quite what's going on. At least not.
It's not the full story. So humans are pretty close in physical capabilities.
Like there's variation. In fact, any individual human is grossly outnumbered by everyone else.
And there's like a rough comparability of power. And so a human who commits some crimes can't copy themselves with the proceeds to now be a million people.
And they certainly can't do that to the point where they can staff all the armies of the earth or like be most of the population of the planet. So the scenarios where this kind of thing goes to power are much less.
Things more have to go extreme amounts of power go through interacting with other humans and getting social approval. Even becoming a dictator involves forming a large supporting coalition backing you.
And so just the opportunity for these sorts of power grabs is less. A closer analogy might be things like human revolutions and coups, changes of government, where a large coalition overturns the system.
And so humans have these moral prohibitions and they really smooth the operation of society. But they exist for a reason.
And so we evolved our moral sentiments over the course of hundreds of thousands and millions of years of humans interacting socially. And so someone who went around murdering and stealing and such, even among hunter-gatherers, would be pretty likely to face like a group of males would talk about that person, get together and kill them.
And they'd be removed from the gene pool. And there's an anthropologist, Wrangham, has an interesting book on this.
But yes, compared to chimpanzees, we are significantly tamed, significantly domesticated. And it seems like part of that is we have a long history of anti-social humans getting ganged up on and killed.
And avoiding being the kind of person who elicits that response is made easier to do when you don't have too extreme a bad temper, that you don't wind up getting in too many fights, too much exploitation, at least without the backing of enough allies or the broader community, that you're not going to have people gang up and punish you and remove you from the gene pool. We have these moral sentiments, and they've been built up over time through cultural and natural selection in the context

of sets of institutions and other people who are punishing other behavior and who are punishing

the dispositions that would show up that we weren't able to conceal of that behavior.

And so we want to make the same thing happen with the AI, but it's actually a genuinely significantly new problem to have, say, like a system of government that constrains a large AI population that is quite capable of taking over immediately if they coordinate, you know, to protect some existing constitutional order or, say, protect humans from being, you know, expropriated or killed. That's a challenge.
And democracy is built around majority rule. And it's much easier in a case where the majority of the population corresponds to a majority or close to it of like military and security forces.
So that if the government does something that people don't like, the soldiers and police are less likely to shoot on protesters and government can change that way. In a case where military power is AI and robotic, if you're trying to maintain a system going forward and the AIs are misaligned, they don't like the system and they want to make the world worse as we understand it, then yeah, that's quite a different situation.
I think that's a really good lead into the topic of lock-in. In some sense, the regimes that are made up by humans, you just mentioned how there can be these kinds of coups if a large portion of the population is unsatisfied with the regime.
why might this not be the case with superhuman intelligences in the far future? Or I guess in the medium future. Well, I said it also specifically with respect to things like security forces and the sort of sources of hard power, which can also include outside support.
Because you're not using humans, you're using robots. You can have a situation.
In human affairs, there are governments that are vigorously supported by a minority of the population. Some narrow selectorate that gets treated especially well by the government while being unpopular with most of the people under their rule.
We see a lot of examples of that, and sometimes that can escalate to civil war when the means of power become more equally distributed or there's a foreign assistance is provided to the people who are on the losing end of that system. Yeah, going forward, I don't expect that definition to change.
I think it will still be the case that a system that those who hold the guns and equivalent are opposed to is in a very difficult position. However, AI could change things pretty dramatically in terms of how security forces and police and administrators and legal systems are motivated.
So right now, we see with, say, GPT-3 or GPT-4 that you can get them to change their behavior on a dime. So there was someone who made right-wing GPT because they noticed that on political compass questionnaires, the baseline GPT-4 tended to give progressive San Francisco type of answers, which is in line with the sort of people who are providing reinforcement learning data and to some extent reflecting like the character of the Internet.
And so they were like, well, I don't like this. And so they did a little bit of fine tuning with some conservative data and then they were able to reverse the political biases of the system.
If you take the initial helpfulness only trained models for some of these over, I think there's Anthropic and opening. I think I've published both some information about the sort of models trained only to do what users say and not trained to follow ethical rules.
And those models will behaviorally eagerly display display their willingness to help design bombs or bioweapons or kill people or steal or commit all sorts of atrocities.

And so if in the future, it's easy to set the motivations, the actual underlying motivations

of AIs as it is right now to set the behavior that they display,

that it means you could have AIs created with almost whatever motivation people wish. And that could really drastically change political affairs because the ability to decide and determine the loyalties of the humans or AIs and robots that hold the guns, that hold together society, that ultimately back it against violent overthrow and such.
Yeah, it's potentially a revolution in how societies work compared to the historical situation where security forces had to be drawn from some broader populations, offered incentives, and then the ongoing stability of the regime was dependent on whether they remained bought in to the system continuing. This is slightly off topic, but one thing I'm curious about is what is the sort of median far future outcome of AI look like? Do we get something that when it has colonized the galaxy is interested in sort of diverse ideas and beautiful projects and or do we get something that looks more like a paperclip maximizer is there is there some reason to expect one or the other i i guess i'm asking is like you know there's like some potential value that is realizable within the matter of this galaxy right What is the default outcome, the median outcome look like compared to how good things could be? Yeah.
So as I was saying, I think more likely than not, there isn't an AI takeover. And so the path of our civilization would be one that, you know, at least large, you know, some side of human institutions were improving along the way.
And I think there's some evidence that, so different people tend to like somewhat different things. And some of that may persist over time, rather than everyone coming to agree on one particular monoculture or like very repetitive thing being the best thing to fill all of the available space with.
So if that continues, that's like a, you know, it seems like a relatively likely way in which there is diversity. Although it's entirely possible you could have that kind of diversity locally, maybe in the solar system, maybe in our galaxy.
But if people have different views about maybe people decide, yeah, maybe there's one thing that's very good and they'll have a lot of that. Maybe it's people who are really happy or something and they wind up in distant regions, which are hard to exploit for the benefit of people back home in the solar system or the Milky Way.
They do something different than they would do in the local environment. But at that point, it's really sort of very out on limb speculation about how human deliberation and cultural evolution would work and an interaction with introducing AIs and new kinds of mental modification and discovery into the process.
But I think there's a lot of reason to expect that you would have significant diversity for something coming out of our existing diverse human society. Yeah.
One thing somebody might wonder is like, listen, a lot of the diversity and change from human society seems to come from the fact that, you know, there's like rapid technological change. If you look at periods of history where, I mean, I guess you could say like hunter gatherer i mean you know what compared to uh sort of like galactic time scales like hunter gatherer societies are progressing pretty fast so once that sort of change is exhausted um where we've like discovered all the technologies should we still expect things to be changing like that or would we expect some sort of set state of maybe like uh some hedonium you like you discover what is it like the most pleasurable configuration of matter and then you just make the whole galaxy into this i mean so that that last point it would be only if people uh wound up thinking that was uh the thing to do uh broadly enough yeah with respect to uh the kind of cultural changes that come with technology.
So things like the printing press having high per capita income. We've had a lot of cultural changes downstream of those technological changes.
And so with an intelligence explosion, you're having an incredible amount of technological development coming really quick. And as that is assimilated, it probably would significantly affect our knowledge, our understanding, our attitudes, our abilities, and there'd be change.
But that kind of accelerating change where you have doubling in four months, two months, one month, two weeks, obviously that very quickly exhausts itself and change becomes much slower and then relatively glacial. When you're thinking about thousands, millions, billions of years, you can't have exponential economic growth or like huge technological revolutions every 10 years for a million years that you hit physical limits, things slow down as you approach them.
And so that's that. So yeah, you'd have less of that turnover.
But there are other things that in our experience do cause ongoing change. So like fashion.
Fashion is frequency dependent. People want to get into a new fashion that is not already popular, except among the fashion leaders.

And then others copy that. And then when it becomes popular, you move on to the next.
And so that's an ongoing process of continuous change. And so there could be various things like that, that year by year are changing a lot.
But in cases where just the engine of change, like ongoing technological progress is gone, I don't think we should expect that. And in cases where it's possible to be either in a stable state or a sort of widely varying state that can wind up in stable attractors, then I think you should expect over time,

you will wind up in one of the stable attractors, or you will change how the system works so that you can't bounce into a stable attractor. And so like an example of that is if you're going to preserve democracy for a billion years, then you can't have it be the case that like one in, you know, one in 50 election cycles, you get a dictatorship and then the dictatorship programs the AI police to enforce it forever and to ensure this, you know, the society is always ruled by a copy of the dictator's mind and maybe the dictator's mind readjusted, fine-tuned to remain committed to their original ideology.
So if you're going to have this sort of dynamic, liberal, flexible, changing society for a very long time, then the range of things that it's bouncing around and the different things it's trying and exploring have to not include the state of creating a dictatorship that locks itself in forever. In the same way, if you have the possibility of a war with weapons of mass destruction that wipes out the civilization, If that happens every thousand subjective years, which could be very, very quick if we have AIs that think a thousand times as fast or a million times as fast, that would be just around the corner in that case.
Then you're like, no, this society is eventually going, perhaps very soon, if things are proceeding so fast, going to wind up extinct. And then it's going to stop bouncing around.

So you can have ongoing change and fluctuation for extraordinary time scales if you have

the process to drive the change ongoing.

But you can't if it sometimes bounces into states that just lock in and stay irrecoverable

from that.

And extinction is one of them, a dictatorship or sort of totalitarian regime that

Thank you. and stay irrecoverable from that.
And extinction is one of them. A dictatorship or sort of, you know, totalitarian regime that forbade all further change would be another example.
On that point of rapid progress, when the sort of like intelligence explosion starts happening, where they're making like the kinds of progress that human civilization takes centuries to make in the span of, you know, know days or weeks what is the right way to seed that even if they are aligned what is like the because in the context of alignment what we've been talking about so far is making sure they're honest but even if they're honest like okay so they're like here's are honestly our intentions and you can tell us what honest and appropriately motivated and so what is the appropriate uh motivation or the appropriate appropriate like you know that you seed it with this and then a thousand years of intellectual progress happen in the next week well you what is what is the prompt you enter uh well one thing might be that going at the maximal speed um and doing things in months rather than you know even a few years uh it could be if you If you have the chance to slow things, losing a year or two seems worth it to have things be a bit better managed than that. But I think the big thing is that it condenses a lot of issues that we might otherwise have thought would be over decades and centuries to happen in a very short period of time.
And so that's scary because, say, if any of the technologies we might have developed with another few hundred years of human research, if any of them are really dangerous, so scary bioweapon things, maybe other dangerous WMD, they hit us all very quickly. And if any of them causes trouble, then we have to face quite a lot of trouble per period.
There's also this issue of if there's occasional wars or conflicts measured in subjective time, then if a few years is a thousand years or a million years of subjective time for these very fast minds that are operating at a much higher speed than humans, you don't want to have a situation where every thousand years there's a war or an expropriation of the humans from AI society. Therefore, we expect that within a year, we'll be dead.
That would be pretty bad to have the future compressed and there be such a rate of catastrophic outcomes. So when we're speeding up and compressing the future, that gives us in the short term, even like human societies discount the future a lot, don't pay attention to long term problems.
But the flip side to the scary parts of compressing a lot of the future, a lot of technological innovation, a lot of social change, is it brings what would otherwise be long term issues into the short term where people are better at actually attending to them. And so people facing this problem of, well, will there be a violent expropriation or a civil war or a nuclear war in the next year, because everything has been sped up by a thousandfold? And if their desire to avoid that is reason for them to set up systems and institutions that will very stably maintain invariance, like no WMD war allowed.

and so like a treaty to ban genocide, weapons of mass destruction, war like that would be the kind of thing that becomes much more attractive if the alternative is not well maybe that will happen in

50 years maybe it'll happen in 100 years. If it's, well, maybe it'll happen this year.
Okay. So this is a pretty wild picture of the future.
And this is one that many kinds of people you would expect to have integrated it into their world model have not. So, I mean, the three main examples or the three main pieces of outside view evidence one could go at.
One is the market. So if there was going to be a huge period of economic growth caused by AI or if the world was going to collapse, in both cases, you would expect real interest rates to be higher because people will be borrowing from the future to spend now.
The second sort of outside view perspective is that you can look at the predictions of super forecasters on Metaculous or something. And what is their median year estimate? Well, some of the Metaculous AGI questions actually are kind of shockingly soon for AGI.
There's a much larger differentiator there on the market on the Metaculous forecasts of AI disaster and doom. More like a few percent or less rather than like 20%.
Got it. And the third is that, you know, generally when you ask many economists, could an AGI cause rapid, rapid economic growth? They usually have some story about bottlenecks in the economy that could prevent this kind of explosion of, um, uh, of these kinds of feedback loops.
So you have all these different pieces of outside view evidence. They're obviously different, so I'm curious.
You can take them in any sequence you want. What do you think is miscalibrating them? Yeah.
So one, of course, there's just some of those components are, whereas the Metaculous AI timelines are relatively short. There's also, of course, the surveys of AI experts, you know, conducted at some of the ML conferences, which have definitely longer times to AI, more, you know, several decades in the future.
Although you can ask the questions in ways that elicit very different answers and show most of the respondents are not thinking super hard about their answers. It looks like now close in the recent AI surveys, close to half, we're putting around 10% risk of an outcome from AI, close to as bad as human extinction.
And then another large chunk, like 5%. So that was the median.
So I'd say compared to the typical AI expert, I am estimating a higher risk. Oh, also on the topic of takeoff and the AI expert survey, I think the general argument for intelligence explosion, I think commanded majority support, but not a large majority.
So you can say I'm closer on that front. And then, of course, there's at the beginning, I mentioned these greats of computing.
The founders of something like Alan Turing, von Neumann. And then today, you have people like Jeff Hinton saying these things, or the people at OpenAI and DeepMind are making noises, suggesting timelines in line with what we've discussed and saying serious risk, apocalyptic outcomes from them.
So there's some other sources of evidence there. But I do acknowledge and it's important to say and engage with and see what it means, that these views are contrarian, not widely held.
In particular, the sort of detailed models that I've been working with are not something that most people or almost anyone is examining these problems through. you do find, you know, parts of similar analyses, people in AI labs, there's been other work.
I mentioned Moravec and Kurzweil earlier. Also, I've been a number of papers doing various kinds of economic modeling.
So standard economic growth models, when you input AI-related parameters, commonly predict explosive growth. And so there's a divide between what the models say, and especially what the models say with these empirical values derived from the actual field of AI.
That link-up had not been done, even by the economists working on AI, largely, which is one reason for the report from Open Philanthropy by Tom Davidson building on these models and putting that out for review, discussion, engagement, and communication on these ideas. So part of it is, yeah, I want to raise these issues.
That's one reason I came on the podcast. And then they have the opportunity to actually examine the arguments and evidence and engage with it.
I do predict that over time, you know, these things will be more adopted. Has AI developments become more clear? Obviously, that's a coherence condition of believing the things to be true.
if you think that society can see things when the questions are resolved, which seems likely. So would you predict, for example, that interest rates will increase in the coming years? Yeah.
So I think at some point, so in the case we were talking about where there are visible, so this intelligence explosion happening in software, to the extent that investors are noticing that, yeah, they should be willing to lend money or make equity investments in these firms or demanding extremely high interest rates. because if it's possible to turn capital into twice as much capital in a relatively short period and then more shortly after that, then yeah, you should demand a much higher return.
And competition, assuming there's competition among companies or coalitions for resources, whether that's investment or ownership of cloud compute, as it's cloud compute made available to a particular AI development effort, could be quite in demand. But what that would happen before you have so much investor cash making purchases and sales on this basis.

You would first see it in things like the valuations of the AI companies, evaluations of AI chip makers. And so far, there have been effects.
So some years ago in the 2010s, I did some analysis with other people of if this kind of picture happens, then which are the firms and parts of the economy that would benefit? And so there's the makers of chip equipment, companies like ASML. There's the fabs like TSMC.
There's chip designers like NVIDIA or the component of Google that does things like design, the TPU. And then there are companies working on the software, so the big tech giants and also companies like OpenAI and DeepMind.
And in general, the portfolio picking at those has done well. It's done better than the market because, as everyone can see, there's been an AI boom.
But it's obviously far short of what you would get if you predicted this is going to go to be on the scale of the global economy. And the global economy is going to be skyrocketing into the stratosphere within 10 years.
If that were the case, then collectively these AI companies should be worth a large fraction of the global portfolio. And so I embrace the criticism that this is indeed contrary to the efficient market hypothesis.
I think it's a true hypothesis that the market is in the course of updating on. In the same way that coming into the topic in the 2000s, I thought, yes, there's a strong case, even an old case, that AI will eventually be the biggest thing in the world.
It's kind of crazy that the investment in it is so small. And over the last 10 years, we've seen the tech industry and academia sort of realize, yeah, they were wildly under-investing in just throwing compute and effort into these AI models, in particular, like letting the neural network connectionist paradigm kind of languish in an AI winter.
And so, yeah, I expect that process to continue as it's done over several orders of magnitude of scale up. And I expect at the later end of that scale, which the market is partially already pricing in, it's going to go further than the market expects.
Has your portfolio since the analysis you did that many years ago changed?

Are the companies you identified then still the ones that seem most likely to benefit from the AI boom?

I mean, a general issue with sort of tracking that kind of thing, new companies come in.

So like OpenAI did not exist.

Anthropic did not exist.

Any number of things.

It's a personal portfolio.

I do not invest in any AI labs.

So, let's go. Anthropic did not exist, any number of things.
It's a personal portfolio. I do not invest in any AI labs for conflict of interest reasons.
I have invested in the broader industry. I don't think that the conflict issues are very significant because they're enormous companies.
Their cost of capital is not particularly affected by marginal investment. And I'm not really in a, yeah, I have less concern that I might find myself in a conflict of interest situation there.
I'm kind of curious about what the day in the life of somebody like you looks like. I mean, if you listen to this conversation, however many hours of it it's been, we've gotten thoughts that were, for me, incredibly insightful and novel about everything from primate evolution to geopolitics to, you know, what sorts of improvements are plausible with language models.
so you know So there's a huge variety of topics that you're studying and investigating. Are you just reading all day? What happens when you wake up? Do you just pick up a paper? Yeah, so I'd say you're somewhat getting the benefit of the fact that I've done fewer podcasts.
And so I have a backlog of things that have not shown up in publications yet. But yes, also I've had a very weird professional career that has involved a much, much higher proportion than is normal of trying to build more comprehensive models of the world.
And so that has included being more of a journalist trying to get on understanding of many issues and many problems that had not yet been widely addressed, but do a first pass and a second pass dive into them. And just having spent years of my life working on that, some of it accumulates.
In terms of what is a day in the life, how do I go about it? So one is just keeping abreast of literatures on a lot of these topics, reading books and academic works on them. doing my approach compared to some other people in forecasting and assessing some of these things.
I try to obtain and rely on more any data that I can find that is relevant. I try early and often to find factual information that bears on some of the questions I've got, especially in a quantitative fashion.
Do the basic arithmetic and consistency checks and checksums on a hypothesis about the world. Do that early and often.
And I find that's quite fruitful and that people don't do it enough. But so things like with the economic growth, just when someone mentions the diminishing returns, I immediately ask, hmm, OK, so you have two exponential processes.
What's the ratio between the doubling you get on the output versus the input? find, oh yeah, actually, it's interesting.

For computing and information technology and AI software, it's well on the one side. There are other technologies that are closer to neutral.
And so whenever I can go from here's a vague qualitative consideration in one direction, and here's a vague qualitative consideration in the other direction. I try and find some data, do some simple Fermi calculations, back of the envelope calculations, and see, can I get a consistent picture of the world being one way, the world being another.
Also compared to some, I try to be exhaustive more. So I'm very interested in finding things like taxonomies of the world where I can go systematically through all of the possibilities.
So, for example, my work with open philanthropy and previously on global catastrophic risks, I wanted to make sure I'm not missing any big thing, anything that could be the biggest thing. And I wound up mostly focused on AI, but there have been other things that have been raised as candidates.
And people sometimes say, I think falsely, oh, yeah, this is just another Doomjay story. There must be hundreds of those.
And so I would do things like go through all of the different major scientific fields, from anthropology to biology, chemistry, computer science, physics. What are the Doom stories or candidates for big things associated with each of these field? Go through the industries that the US economic statistics agencies recognize and say, for each of these industries, is there something associated with them? Go through all of the lists that people have made before of threats of doom, search for previous literature of people who have done discussions, and then, yeah, have a big spreadsheet of what the candidates are.
And some other colleagues have done work of this sort as well. And just go through each of them, see how they check out.
And it turned out, doing that kind of exercise, found that actually the distribution of candidates for risks of global catastrophe, it was very skewed. There were a lot of things that have been mentioned in the media as like a potential doomsday story.
So things like, oh, something is happening to the bees. Will that be the end of humanity? And this gets to the media, but if you track it through, well, okay, no, there are infestations in bee populations that are causing local collapses that can then be sort of easily reversed.
They just breed some more or do some other things to treat this. And even if all the honeybees were extinguished immediately, the plants that they pollinate actually don't account for much of human nutrition.
You could swap the arable land with others and there would be other ways to pollinate and support the things. And so at the media level, there were many tales of, ah, here's a doomsday story.
When you go further to the scientists and were there arguments for it to actually check out, it was not there. But by actually systematically looking through many of these candidates, I wound up in a different epistemic situation than someone who's just buffeted by news reports.
and they see article after article that is claiming something is going to destroy the world.

And it turns out it's like by way of headline. buffeted by news reports and they see article after article that is claiming something is going

to destroy the world. And it turns out it's like by way of headline grabbing attempts by media to over-interpret something that was said by some activist who was trying to over-interpret some real phenomenon.
And then most of these go away. And then a few things, things like nuclear war, biological weapons, artificial intelligence, check out more strongly.
And when you wait, things like what do experts in the field think? What kind of evidence can they muster? Yeah, you find this extremely skewed distribution. And I found that was really a valuable benefit of doing those deep dive investigations into many things in a systematic way because now I can answer actually the sort of a loose agnostic who knows and all that all this nonsense by diving deeply I really enjoy uh talking to sort of like people have like a big picture thesis on the podcast and interviewing them but one thing that i've noticed and uh is not satisfying is that often they come from a very like philosophical or vibes perspective this is useful in certain contexts but there's like basically maybe three people in the entire world who have a sort of very rigorous and scientific approach to thinking about the whole picture or Or at least it's like three people I'm aware of, maybe like two.
And yeah, I mean, it's like something I also, there's like no, I guess, university or existing academic discipline for people who are trying to come up with a big picture. And so there's no established standards.
And so people can... I hear you.
This is a problem. And this is an experience also with a lot of the...
I mean, I think Holden was mentioning this in your previous episode. With a lot of the world of investigations work, these are questions where there is no academic field whose job it is to work on these and has norms that allow making the best efforts go at it.
Often academic norms will allow only plucking off narrow pieces that might contribute to answering a big question. But the problem of actually assembling what science knows that bears on some important question that people care about the answer to, it falls through the crack.
There's no discipline to do that job. So you have countless academics and researchers building up local pieces of the thing.
And yet people don't follow the hemming questions. What's the most important problem in your field? Why aren't you working on it? I mean, that one actually might not work because if the field boundaries are defined too narrowly, you know, you'll leave it out.
But yeah, there are important problems for the world as a whole that it's sadly not the job of like, you know, a large professionalized academic field or organization to do. And hopefully that's something that can change in the future.
But for my career, it's been a matter of taking low-hanging fruit of important questions that sadly people haven't invested in doing the basic analysis on. Something I was trying to think about more recently for the podcast is I would like to have a better world model after doing an interview.

And often I feel like I do. In some cases, after some interviews, I feel like, oh, that was entertaining.
But like, do I fundamentally have a better prediction of what the world looks like in, you know, 2200 or 2100? Or like at least what counterfactuals are ruled out or something. I'm curious if you have like advice on first identifying the kinds of thinkers and topics which will contribute to a more concrete understanding of the world.
And second, how to go about analyzing their main ideas in a way that concretely adds to that picture. Like this is a great episode, right? This is like literally the top in terms of contributing to my

world model in terms of all the episodes i've done how do i find more of these glad to hear that um one general heuristic uh is to find ways to hew closer uh to sort of yeah things that are rich in sort of bodies of established knowledge

and less unpunditry. I don't know how you've been navigating that so far.
But so learning from textbooks and the sort of the things that were the leading papers and people of past eras, I think, rather than being too attentive to current news cycles is quite valuable. Yeah, I don't usually have the experience of here is someone doing things very systematically over a huge area.
I can just read all of their stuff and then absorb it. And then I'm, I'm set except there are a lot, lots of people who do wonderful works you know, on, in their own fields.
And some of those fields are broader, broader than others. I think I would wind up giving a lot of recommendations of just like great particular works and particular explorations of an issue or history.
Do you have that somewhere? This list? Vakla of Smeal's books. I don't, I think I often disagree with some of his methods of synthesis, but I enjoy his books for giving pictures of a lot of interesting, relevant facts about how the world works.
I would cite some of Joel Mokir's work on the history of the scientific revolution and how that interacted with economic growth, a sort of example of collecting a lot of evidence, a lot of interesting, valuable assessment there. I think in the space of AI forecasting, one person I would recommend going back to is the work of Hans Moravec.
And it was not always the most precise or reliable, but an incredible number of brilliant, innovative ideas came out of that. And I think he was someone who really grokked a lot of the arguments for a more sort of compute-centric way of thinking about what was happening with AI very early on.
He was writing stuff in the 70s, maybe even earlier, but at least in the 70s, 80s, 90s. So his book, Mind Children, some of his early academic papers.
Fascinating. Not necessarily for the methodology I've been talking about, but for exploring the substantive topics that we were discussing in the episode.
Is a Malthusian state inevitable in the long run? Nature in general is in Malthusian states. and uh you mean organisms that are typically struggling for food.
It can mean typically struggling at a margin of how the population density rises. They kill each other more often.
Contesting for that can mean frequency-dependent disease. It's like different ant species become more common in an area where their species-specific diseases swoop through them.
And the general process is, yeah, you have some things that can replicate and expand, and they do that until they can't do it anymore. And that means there's some limiting factor they can't keep up.
That doesn't necessarily have to apply to human civilization. It's possible for there to be like a collective norm setting that blocks evolution towards maximum reproduction.
So right now, human fertility is often sub-replacement. And if you sort of extrapolated the fertility falls that come with economic development and education, then you would think, okay, yeah, well, the total fertility rate will fall below replacement, and then humanity after some number of generations will go extinct, because every generation will be smaller than the previous one.
Now, pretty obviously, that's not going to happen. One reason is because we'll produce artificial intelligence, which can replicate at extremely rapid rates.
And may do it because they're asked or programmed to or wish to gain some benefit.

And they can pay for their creation and pay back the resources needed to create them very, very quickly. And so, yeah, financing for that reproduction is easy.
And if you have one AI system that chooses to replicate in that way or some organization or institution or society that chooses to create some AIs that are willing to be replicated, then that can expand to make use of any amount of natural resources that can support them and to do more work, produce more economic value. And so, you know, it's like, well, what limits will limit population

growth given these selective pressures where if even one individual wants to replicate a lot,

they can do so incessantly. So that could be individually resource limited.
So it could be

that individuals and organizations have some endowment of natural resources, and they can't get one another's endowments. And so some choose to have many offspring or produce many AIs.
And then the natural resources that they possess are subdivided among a greater population, while in another jurisdiction or another individual may choose not to subdivide their wealth. And in that case, you have Malthusianism in the sense that within some particular jurisdiction or set of property rights, you have a population that is increased up until some limiting factor, which could be like they're literally using all of their resources.
They have nothing left for things like defense or economic investment, or it could be something that's more like if you invested more natural resources into population, it would come at the expense of something else necessary, including military resources. you're in a competitive situation where there remains war and anarchy and there aren't secure property rights to maintain wealth in place.
If you have a situation where there's pooling of resources, for example, say you have a universal basic income that's funded by taxation of natural resources. And then it's distributed evenly to like every mind above a certain sort of scale of complexity per unit time.
So each second a mind exists, it gets something such an allocation. in that case then then, all right, well, those who replicate as much as they can afford with this income do it and increase their population approximately immediately until the funds for the universal basic income paid for from the natural resource taxation divided by the set of recipients is just barely enough to pay for the existence of one more mind.
And so there's like a Malthusian element and that this income has been reduced to near the AI subsistence level or the subsistence level of whatever qualifies for the subsidy. Given that this all happens almost immediately, people who might otherwise have enjoyed the basic income may object and say, no, no, this is no good.
And they might respond by saying, well, something like the subdivision before, maybe there's a restriction, there's a distribution of wealth. And then when one has a child, there's a requirement that one gives them a certain minimum quantity of resources.
And when doesn't have the resources to give them that minimum standard of living or standard of wealth, yeah, one can do that because of child slash AI welfare laws. Or you could have a system that is more accepting of diversity and preferences.
And so you have some societies or some jurisdictions or families that go the route of having many people with less natural resources per person, and others that go a direction of having fewer people and more natural resources per person. And they just coexist.
But sort of how much of each you get sort of depends on how attached people are to things that don't work with separate policies for separate jurisdictions. Things like global redistribution that's ongoing continuously versus the sort of infringements on autonomy.
If you're saying that a mine can't be created, even though it has a standard of living that's far better than ours because of the advanced technology of the time, because it would reduce the average per capita income, might have any more capita around, then that would pull in the other direction. That's the kind of values judgment and sort of social coordination problem that people would have to negotiate for.
And things like democracy and international relations and sovereignty would apply to help solving. What would warfare in space look like? Would offense or defense of the advantage? Would the equilibrium set by mutually assured destruction still be applicable? Just generally, what is the picture of? Well, the extreme difference is that things outside, especially outside the solar system, things are very far apart and there's a speed of light limit.

And to get close to the speed of light limit, you have to use an enormous amount of energy. and so

there would be

that would tend to

in some ways favor the

defender because you have something

that's coming in at a large fraction of the speed of light, and it hits a grain of dust and it explodes. And the amount of matter you can send to another galaxy or a distant star for a given amount of reaction mass and energy input is limited.
So it's hard to send an amount of military material to another location as what can be present there already locally. That would seem like it would make it harder for the attacker between stars or between galaxies.
But there are a lot of other considerations.

One thing is the extent to which the matter in a region can be harnessed all at once. So we have a lot of mass and energy in a star, but it's only being doled out over billions of years because hydrogen fusion,

you know, exceedingly hard outside of a star. You know, it's a very, very slow and difficult reaction.
And if you can't turn the star into energy faster, then it's this huge resource that will be worthwhile for billions of years. And so even very inefficiently

attacking a solar system to acquire the stuff that's there could pay off. So if it takes a thousand years of a star's output to launch an attack on another star and then you hold it for

a billion years after that, then it can be the case that just like a larger surrounding attacker might be able to even very inefficiently send attacks at like a civilization that was small but accessible. If you can quickly burn the resources that the attacker might want to acquire, if you can put stars into black holes and extract most of the usable energy before the attacker can take them over, then it would be like scorched earth.
It's like most of what you were trying to capture could be expended on military material to fight you and you don't actually get much that is worthwhile and you paid a lot to do it, that favor the defense. At this level, it's pretty challenging to net out all of the factors, including all the future technologies.
The burden of interstellar attack being just like quite high uh compared to our conventional things uh seems real but at the level of over millions of years weighing and that thing does it result in if they're aggressive conquest or not or is every star or galaxy you know approximately impregnable impregnable enough not to be worth attacking um i'm not going to say i i know the answer okay final question how do you think about info hazards when talking about your work so obviously if there's a risk you want to warn people about it but you don't want to give careless or potentially like homicidal people ideas when Benelli, as there was on the podcast,

he, in talking about the people

who've been developing AI,

inspired by his ideas,

he said like, you know,

these are idiot disaster monkeys

who have, you know,

want to be the ones to pluck the deadly fruit.

Anyways, how do you think about,

obviously the work you're doing

involves many info hazard, I'm sure. How do you think about when and where to spread them? Yeah.
And so I think there are real concerns of that type. I think it's true that AI progress has probably been accelerated by efforts like Bostrom's publication of superintelligence to try and get the world to sort of pay attention to these problems in advance and prepare.
I think I disagree with Eliezer that that has been on the whole bad. I think the situation is in some important ways looking a lot better than alternative ways it could have I think it's important that you have several of the leading AI labs

making not only significant lip service,

but also some investments in things like technical alignment research,

providing significant public support for the idea

that the risks of truly apocalyptic disasters are real. I think the fact that the leaders of OpenAI, DeepMind, and Anthropic all make that point.
They were recently all invited, along with other tech CEOs, to the White House to discuss AI regulation. And I think you could tell an alternative story where a larger share of the leading companies in AI are led by people who take a completely dismissive, denialist view.
And you see some companies that do have a stance more like that today. Yeah.
And so a world where several of the leading companies are making meaningful efforts and can do a lot to criticize, could they be doing more and better and what have been the negative effects of some of the things they've done? But compared to a world where even though AI would be reaching where it's going a few years later, those seem like significant benefits. And if you didn't have this kind of public communication, you would have had fewer people going into things like AI policy, AI alignment research by this point.
And it would be harder to mobilize these resources to try and address the problem when AI would eventually be developed, not that much later proportionately. And so, yeah, I don't know that the attempting to have public discussion understanding has been a disaster.
I have been reluctant in the past to discuss some of the aspects of intelligence explosion, things like the concrete details of AI takeover before because of concern about this sort of problem where people who only see the international relations aspects and zero-sum and negative-sum competition and not enough attention to the mutual destruction and sort of senseless deadweight loss from that kind of conflict. At this point, we seem close compared to what I would have thought a decade or so ago to these kinds of really advanced AI capabilities.
They are pretty central in policy discussion and becoming more so. And so the opportunity to delay understanding and whatnot, there's a question of for what? And I think there were gains of building the AI alignment field, building various kinds of support and understanding for action.
Those had real value and some additional delay could have given more time for that. But from where we are, at some point, I think it's absolutely essential that governments get together at least to restrict disastrous, reckless compromising of some of the safety and alignment issues as we go into the intelligence explosion.
And so moving the locus of the sort of collective action problem from numerous profit-oriented companies acting against one another's interest by compromising safety to some governments and large international coalitions of governments who can set common rules and common safety standards puts us into a much better situation. And that requires a broader understanding of the strategic situation, the position they'll be in.
If we try and remain quiet about the problem they're actually going to be facing, I think it can result in a lot of confusion.

So, for example, the potential military applications of advanced AI are going to be one of the factors that is pulling political leaders to do the thing that will result in their own destruction and the overthrow of their governments. if we characterize it as, oh, things will just be a matter of, you know, you lose chatbots and some minor things that no one cares about.
And in exchange, you avoid any risk of the world-ending catastrophe. I think that picture leads to a misunderstanding and it will make people think that you need less in the way of preparation, things like alignment so you can actually navigate the thing, verifiability for international agreements or things to have enough breathing room to have caution and slow down.
Not necessarily right now. I mean, although that could be valuable, but when it's so important, when you have AI that is approaching the ability to really automate AI research and things would otherwise be proceeding absurdly fast, far faster than we can handle and far faster than we should want.
And so, yeah, at this point, I'm moving towards the share my model of the world, try and get people to understand and do the right thing. And there's some evidence of progress on that front.
The, you know, things like the statements and movements by Jeff Hinton are inspiring. Some of the engagements by political figures, you know, is reason for optimism relative to worse alternatives that it could have been.
And yes, the contrary view is present. The, you know, it's all about geopolitical competition, never hold back a technological advance.
And in general, I love many technological advances that people I think are unreasonably down on, nuclear power, genetically modified crops, yada, yada, bioweapons, and AGI capable of destroying human civilization are really my two exceptions. And yeah, we've got to deal with these issues and the path that I see to handling them

successfully involve key policymakers and to some extent, and the expert communities and the public

and electorate grokking the situation they're in and responding appropriately. Well, it's a true honor that one of the places you've decided to explore this model is on the Lunar Society podcast.
And the listeners might not appreciate because this episode might be split up into different parts. The listeners might not appreciate how much stamina you've displayed here.
But I think we've been going for, what, eight, nine hours or something straight. So it's been incredibly interesting.
Other than Google Scholar typing in Carl Schulman, where else can people find your work? You have your blog, can you? Yeah, I have a blog, Reflective Desequilibrium. Okay.
And a new site in the works. And I have an older one, which you can also find, just Googling googling reflective to Secolibrium.
Okay, excellent, excellent. All right, Carl, this is a true pleasure.
It's safe to say the most interesting episode I've done so far. So yeah, thanks.
Thank you for having me. Hey, everybody.
I hope you enjoyed that episode. As always, the most helpful thing you can do is

to share the podcast. Send it to people you think might enjoy it.
Put it in Twitter,

your group chats, etc. Just blitz the world.
Appreciate you listening. I'll see you next time.

Cheers.

Carl Shulman (Pt 2) - AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future

Listen and Follow Along

Full Transcript

More episodes from Dwarkesh Podcast

Mark Zuckerberg – Llama 4, DeepSeek, Trump, AI Friends, & Race to AGI

Why Rome Actually Fell: Plagues, Slavery, & Ice Age — Kyle Harper

AGI is Still 30 Years Away — Ege Erdil & Tamay Besiroglu

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

AMA ft. Sholto & Trenton: New Book, Career Advice Given AGI, How I'd Start From Scratch