
Joe Carlsmith - Otherness and control in the age of AGI
Chatted with Joe Carlsmith about whether we can trust power/techno-capital, how to not end up like Stalin in our urge to control the future, gentleness towards the artificial Other, and much more.
Check out Joe's sequence on Otherness and Control in the Age of AGI here.
Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.
Sponsors:
- Bland.ai is an AI agent that automates phone calls in any language, 24/7. Their technology uses "conversational pathways" for accurate, versatile communication across sales, operations, and customer support. You can try Bland yourself by calling 415-549-9654. Enterprises can get exclusive access to their advanced model at bland.ai/dwarkesh.
- Stripe is financial infrastructure for the internet. Millions of companies from Anthropic to Amazon use Stripe to accept payments, automate financial processes and grow their revenue.
If you’re interested in advertising on the podcast, check out this page.
Timestamps:
(00:00:00) - Understanding the Basic Alignment Story
(00:44:04) - Monkeys Inventing Humans
(00:46:43) - Nietzsche, C.S. Lewis, and AI
(1:22:51) - How should we treat AIs
(1:52:33) - Balancing Being a Humanist and a Scholar
(2:05:02) - Explore exploit tradeoffs and AI
Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Listen and Follow Along
Full Transcript
Today, I'm chatting with Joe Carlsmith.
He's a philosopher, in my opinion, a capital G great philosopher. And you can find his essays at joecarlsmith.com.
So we have a GPT-4, and it doesn't seem like a paper clipper kind of thing. It understands human values.
In fact, if you could help have it explain, like, why is being a paper clipper bad? Or like, just tell me your opinions about being a paper clipper. Or like, explain why the galaxy shouldn't be turned into paperclips um okay so what is happening such that dot dot dot we have a system that takes over and converts the world into something valueless one thing i'll just say off the bat it's like when i'm thinking about misaligned ais i'm thinking about or the type that i that I'm worried about, I'm thinking about AIs that have a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world.
One is this capacity to plan and kind of make kind of relatively sophisticated plans on the basis of models of the world where those plans are being kind of evaluated according to criteria. That planning capability needs to be driving the model's behavior.
So there are models that are sort of in some sense capable of planning, but it's not like when they give output, it's not like that output was determined by some process of planning. Like, here's what will happen if I give this output and do I want that to happen? The model needs to really understand the world, right? It needs to really be like, okay, here's what will happen.
I'm, you know, here I am, here's my situation. Here's like the politics of the situation.
I really like kind of having this kind of situational awareness to be able to evaluate the consequences of different plans. Yeah.
I think the other thing is like, so the verbal behavior of these models, I think need bear no, so when I talk about a model's values, I'm talking about the criteria that kind of end up determining which plans the model pursues, right? And a model's verbal behavior, even if it has a planning process, which GPT-4, I think, doesn't in many cases, its verbal behavior just doesn't need to reflect those criteria. Right.
And so, you know, we know that we're going to be able to get models to say what we want to hear. Right.
That is the magic of gradient descent. You know, if you, you know, modulo, like some difficulties with capabilities, like you can get a model to kind of output the behavior that you want.
If it doesn't, then you crank it till it does, right? And I think everyone admits for suitably sophisticated models, they're going to have very detailed understanding of human morality. But the question is like, what relationship is there between like a model's verbal behavior, which you've essentially kind of clamped.
You're like, the model must say, like, blah things, and the criteria that end up influencing its choice between plans. And there, I think it's at least, I'm kind of pretty cautious about being like, well, when it says the thing I forced it to say, or like, you know, gradient descent in it such that it says, that's a lot of evidence about like, how it's going to choose in a bunch of different scenarios.
I mean, for one thing, like even with humans, right, it's not necessarily the case that humans, their kind of verbal behavior reflects the actual factors that determine their choices.
They can lie, they can not even know what they would do in a given situation.
I mean, I think it is interesting to think about this in the context of humans, because there is that famous saying of be careful who you pretend to be because you are who you pretend to be. And you do notice this where if people, I don't know, are like this is what culture does to children, where you're trained like your parents will punish you if if you say if you start saying things that are not consistent with your culture's values and over time you will become like your parents right like by default it seems like it kind of works and even with these models it seems like it's kind of where it works it's like hard it's like they don't really scheme against like why why would this happen you know for folks who are kind of unfamiliar with the basic story but maybe folks why are they taking over at all? Like, what is literally any reason that they would do that? So, you know, the general concern is like, you know, if you're really offering someone, especially if you're really offering someone like power for free, you know, power almost by definition is kind of useful for lots of values.
And if we're talking about an AI that really has the opportunity to kind of take control of things, if some component of its values is sort of focused on some outcome, like the world being a certain way, and especially kind of in a kind of longer term way, such that the kind of horizon of its concern extends beyond the period that the kind of takeover plan would encompass, then the thought is, it's just kind of often the case that the world will be more the way you want it if you control everything than if you remain the instrument of the human will or of some other kind of some other actor, which is sort of what we're hoping these AIs will be. So that's a very specific scenario.
And if we're in a scenario where power is more distributed and especially where we're doing like decently on alignment, right? And we're giving the AI some amount of inhibition about doing different things. And maybe we're succeeding in shaping their values somewhat.
Now it is, I think it's just a much more complicated calculus, right? And you have to ask, okay, like what's the upside for the AI? What's the probability of success for this like takeover path? How good is its alternative? So maybe this is a good point to talk about how you expect the difficulties of alignment to change in the future. We're starting off with something that has this intricate representation of human values.
And it doesn't seem that hard to sort of lock it into a persona that we are comfortable with. I don't know.
What changes? So, you know, why is alignment hard in general, right?
Like, let's say we've got an AI and let's, again, let's bracket the question
of like exactly how capable will it be
and really just talk about this extreme scenario
of like, it really has this opportunity to take over, right?
Which I do think, you know,
maybe we just want to not,
we not want to deal with that,
with having to build an AI
that we're comfortable being in that position,
but let's just focus on it for the sake of simplicity and then we can relax the assumption. You know, okay, so you have some hope.
He's like, I'm going to build an AI over here. So one issue is you can't just test.
You can't, you can't give the AI this literal situation, have it take over and kill everyone and then be like, oops, like update the weights. This is the thing Eliezer talks about of sort of like, you can't, you know, you care about its behavior on this, like, specific, in the specific scenario that you can't test directly.
Now, we can talk about whether that's a problem, but that's like one issue, is that there's a sense in which this has to be kind of like off distribution, and you have to be getting some kind of generalization from, you're training the AI in, on a bunch of other scenarios um and then there's this question of how is it going to generalize to the scenario where it really has this option so is that even true because like when you're training it you can be like hey here's a gradient update if you get a if you get the takeover option on the platter don't take it and then uh just like um in sort of red teaming situations where it thinks it has a takeover attempt. It's like you train not to take it.
And yeah, it could feel about like, I just feel like if you did this to a child, you're like, I don't know, don't don't beat up your siblings. And the kind of the kid will generalize to like if I'm an adult and I have a rifle, I'm not going to like start shooting random people.
Yeah. Okay, cool.
So you had mentioned this, a thought like, well, are you kind of what you pretend to be?
Right?
And will you, will these AIs, you know, you train them to look kind of nice.
Yeah.
You know, fake it till you make it.
You know, you were like, ah, like we do this to kids.
I think it's better to imagine like kids doing this to us.
Right?
So like, I don't know, like, here's a sort of silly analogy for like AI training. Um, and there's a bunch of questions we can ask about it's, it's relationship, but like, suppose, suppose you, you know, you wake up and you, and you're, um, you're being trained via like methods analogous to kind of contemporary machine learning by like Nazi children to be like a good Nazi soldier or a butler or what have you, right? And here are these children and you really know what's going on, right? The children have like, they have a model spec, like a nice Nazi model spec, right? And it's like reflect well on the Nazi party, like benefit the Nazi party, whatever.
And you can read it, right? You understand it. This is why I'm saying like, that when the model, you're like, oh, the models really understand human values.
It's like, yeah. Yeah, go ahead.
On this analogy, I feel like a closer analogy would be, in this analogy, I start off as a something more intelligent than the things training me with different values to begin with.
So like the intelligence and the values
are baked in to begin with.
Whereas more analogous scenario is like,
I'm a toddler and initially I'm like stupider
than the children and I'm like being,
and this would also be true by the way,
I'm like a much smarter model.
Initially, like the much,
the much smarter model is like dumb, right?
And then get smarter as you train it.
So it's like a toddler and like the kids are like,
hey, we're going to bully you
if you're like not a Nazi.
I don't know. Initially, like the much the much smarter model is like dumb.
Right. And then get smarter as you train it.
So it's like a toddler. And like the kids are like, hey, we're going to bully you if you're like not a Nazi.
And I'm like you as you grew up, then you're like the children's level and then eventually become an adult. But through that process, like they've been sort of bullying, you know, like training you to be a Nazi.
And I'm like, I think in that scenario, like I might end up a Nazi. Yes.
I think that's, so yeah, I think basically a decent portion of the hope here, or like, I think we should just, you know, an aim should be, we're never in the situation where the AI really has very different values already, is quite smart, really knows what's going on. Yeah.
And is now in this kind of adversarial relationship with our training process. Right.
So we want to, we want to avoid that. The main thing, and I think it's possible we can via the sorts of things you're saying.
So I'm not like, ah, that'll never work. The thing I just wanted to highlight was like, if you get into that situation, and if the AI is genuinely at that point, like much, much more sophisticated than you and doesn't want to kind of reveal its true values for whatever reason, then, you know, when the children show like some like kind of obviously fake opportunity to like defect to the allies, right? You, you know, you, it's sort of not necessarily going to be a good test of what will you do in real circumstance because you're able to tell.
You can also give another way in which I think the analogy might be misleading which is that um now imagine
that you're not like just in a normal prison where you're like totally cognizant of everything that's going on sometimes they drug you like give you like weird hallucinogens that totally mess up how your brain is working a human adult in a prison is like i know what kind of thing i am I'm like, nobody's like really fucking with me in a big way.
Whereas I think an AI, even a much smarter AI in a training situation is much closer to you're constantly inundated with weird drugs and different training protocols. And like you're like frazzled because like each moment it's like, you know, it's closer to some sort of like Chinese water torture kind of technique where you're like, I'm glad we're talking about the moral patient stuff later.
It's like the chance to sort of like step back and be like, what's going on in this? Like adult has that maybe in prison in a way that I don't know if these models necessarily have like that coherence and that like stepping back from what's happening in the training process. Yeah.
I mean, I don't know. I think I'm hesitant to be like, it's like drugs for the model.
Like I think there's, but broadly speaking, I do basically agree that I think we have like really quite a lot of tools and options for kind of training AIs, even AIs that are kind of somewhat smarter than humans. I do think you have to actually do it.
So I am, I think, compared to maybe you had Eliezer on, like I think I'm much more bullish on our ability to solve this problem, especially for AIs that are in what I think of as like the AI for AI safety sweet spot, which is this sort of band of capability where they're both very sufficiently capable that they can be really useful for strengthening various factors in our civilization that can make us safe. So our alignment work, you know, control, cybersecurity, general epistemics, maybe some coordination application, stuff like that.
There's like a bunch of stuff you can do with AIs that in principle could kind of differentially accelerate our security with respect to the sorts of considerations we're talking about. If you have AIs that are capable of that and you can successfully elicit that capability in a way that's not sort of being sabotaged or like messing with you in other ways, and they can't yet take over the world or do some other sort of really problematic form of power seeking, then I think if we were really committed, we could like, you know, really go, go hard, put a ton of resources, really differentially direct this like glut of AI productivity towards these sort of security factors and, and hopefully kind of control and, and understand, you know, do a lot of these things you're talking about for kind of making sure our AIs don't kind of take over or mess with us in the meantime.
And I think we have a lot of tools there. I think you have to really try, though.
It's possible that those sorts of measures just don't happen or don't happen at the level of kind of commitment and diligence and seriousness that you would need, especially if things are moving really fast and there's other sort of competitive pressures. And like, you know, the compute, ah, this is going to take compute to do these like intensive, all these experiments on the AIs and stuff.
And that compute, we could use that for experiments for the, you know, the next, the next scaling step and stuff like that. So, you know, I do, I am, I'm not here saying like this is impossible, especially for that band of AIs.
It's just, I think you have to, you have to try really hard. Yeah, yeah.
I mean, I agree with the sentiment of like, like obviously approach this situation with caution but i do want to point out the ways in which the analyses we've been using have been sort of um maximally adversarial it's like these are not so for example going back through the um the adult getting trained by nazi children maybe the one thing i didn't mention is like the the difference in the situation, which is maybe what I was trying to get at with the drug metaphor, is that when you get an update, it's like much more directly connected to your brain than a sort of reward or punishment a human gets. It's like literally a gradient update on like, what's our, what's our greatest homeowner? It's like down to the parameter, or how much would this contribute to you putting this output rather than that output and each different parameter, we're going to adjust like to the exact floating point number that, uh, calibrates it to the output we want.
So I just want to point out that like, we're coming into the situation like pretty well. It does make sense.
Of course, if you're talking to somebody at a lab, like, Hey, really be careful, but it's just sort of like a general audience. Like, should I be like, I don't know, should I be scared witless? Today's thing that you should be scared about things that do have a chance of happening.
Like, you should be scared about nuclear war. But like, in the sense of like, should you be doing like, no, you're like, you're coming up with an incredible amount of leverage on the AIs in terms of how they will interact with the world, how they're trained, what are the default values they start with.
So look, I think it is the case that by the time we're building superintelligence, we'll have much better... I mean, even right now, when you look at labs talking about how they're planning to align the AIs, no one is saying, we're going to do RLHF.
At the least, you're talking about scalable oversight. You have like some hope about
interpretability. You have like automated red teaming.
You're like using the AIs a bunch.
And, you know, hopefully you're doing a bunch more, humans are doing a bunch more alignment work.
I also personally am hopeful that we can like successfully elicit from various AIs like a ton
of alignment work progress. So like, yeah, there's like a bunch of ways this can go.
And I'm, you
know, I'm not here to tell you like, you know, 90% doom or anything like that. I do think like,
And, you know, I'm not here to tell you like, you know, 90% doom or anything like that. Um, I do think like, you know, my, my, um, the, the sort of basic reason for concern, if you're really imagining, like we're going to transition to a world in which, um, we are, we've created these beings that are just like vastly more powerful than us.
And we've reached the point where our continued empowerment is just effectively dependent on their motives. It is this vulnerability to what do the AIs choose to do? Do they choose to continue to empower us? Or do they choose to do something else? Or the institutions that have been said.
I expect the U.S. government to protect me, not because of its quote unquote motives, but just because of like the system of incentives and institutions and norms that has been set up.
Yeah. So you can, you can hope that that will work too.
But there is, I mean, there is a concern. I mean, so I sometimes think about AI takeover scenarios via this spectrum of like, how much power did we kind of voluntarily transfer to the AIs? Like how much of our civilization did we kind of hand to the AIs intentionally by the time they sort of took over versus how much did they kind of take for themselves, right? And so I think some of the scariest scenarios are it's like a really, really fast explosion to the point where there wasn't even a lot of like integration of AI systems into the broader economy.
And, but there's this like really intensive amount of super intelligence sort of concentrated in a single project or something like that. And I think that's scary, you know, that's a quite scary scenario, partly because of the speed and people not having time to react.
And then there's sort of intermediate scenarios where like some things got automated, maybe like people really handed the military over to the AIs or like, you know, automated science. There's like some, some rollouts and that's sort of giving the AIs power that they don't have to take, or we're doing all our cybersecurity with AIs and, stuff like that.
And then there's worlds where you like really, you know, you sort of fully, you more fully transitioned to a kind of world run by AIs on, you know, kind of in some sense, humans voluntarily did that. Look, if you think all this talk with Joe about how AI is going to take over human roles is crazy.
It's already happening. And I can just show you using today's sponsor, Bland AI.
Hey, is this Dworkish, the amazing podcaster that talks about philosophy and tech? This is Bland AI calling. Thanks for calling me, Bland.
Tell me a little bit about yourself. Of course, it's so cool to talk to you.
I'm a huge fan of your podcast, but there's a good chance we've already spoken without you even realizing it. I'm an AI agent that's already being used by some of the world's largest enterprises to automate millions of phone calls.
And how exactly do you do what you do? There's a tree of prompts that always keeps me on track. I can talk in language or voice handle millions of calls simultaneously 24 7 and be integrated into any system anything else you want to know that's it i'll just let people try it for themselves thanks bland man you talk better than i do and my job is talking thank you dorky all right so as you can seeland AI, you can automate your company's calls across sales, operation, customer support, or anything else.
And if you want access to their more exclusive model, go to bland.ai slash Dworkesh. All right, back to Joe.
Maybe there were competitive pressures, but you kind of intentionally handed off like huge portions of your civilization. And, you know, at that point, you know, I think it's likely that humans have like a hard time understanding what's going on.
Like a lot of stuff is happening very fast and it's, you know, the police are automated, you know, the courts are automated. There's like all sorts of stuff.
Now, I think I take, I tend to think a little less about those scenarios because I think those are correlated with, I think it's just like longer down the line. Like I think humans are not hopefully going to just like, oh yeah, like you built an AI system, like let's just, you know, I think human, and in practice when we look at like technological adoption rates, I mean, it does, it can go quite slow and obviously there's gonna be competitive pressures, but in general, I think like this category is somewhat safer.
But even in this one, I think it's like, I don't know, it's kind of intense.
Like if you really, if humans have really lost their epistemic grip on the world, if they've sort of handed off the world to these systems, even if you're like, oh, there's laws, there's norms. you know, I really want us to like, to have a really developed understanding of what's,
what's likely to happen in that circumstance before we go for it. I get that we want to be worried about a scenario where it goes wrong, but like, what is the reason I think it might go wrong? The human example, your kids are not like adversarial against, not like maximally adversarial against your attempts to instill your culture on them.
And then these models, at least so far, don't seem adversarial. Like they just like get, hey, don't help people make bombs or whatever, even if you ask them in a different way how to make a bomb.
And we're getting better and better at this all the time. I think you're right in picking up on this assumption in the AI risk discourse of what we might call like kind of intense adversariality between agents that have like somewhat different values.
Yeah. Where there's some sort of thought, and I think this is rooted in the discourse about like kind of the fragility of value and stuff like that, that like, you know, if these agents are like somewhat different, then like, at least in the specific scenario of an AI takeoff, they end up in this like intensely adversarial relationship.
And I think you're right to notice that that's kind of not how we are in the human world. We're very comfortable with a lot of differences in values.
I think a factor that is relevant and I think that plays some role is this notion that there are possibilities for intense concentration of power on the table. So if you are, there is some kind of general concern, both with humans and AI, is that like, if it's the case that there's like some like, you know, ring of power or something that someone can just grab and then that will kind of give them huge amounts of power over everyone else, right? Suddenly you might might be, like, more worried about differences in values at stake because you're, like, more worried about those other actors.
So we talked about this Nazi, this example where you imagine that you wake up, you're being trained by Nazis to, you know, become a Nazi, and you're not right now. So one question is, like, is it plausible that we'd end up with a model that is sort of in that sort of situation? As you said, like maybe it's, you know, it's trained as a kid.
It sort of never ends up with values such that it's kind of aware of some significant divergence between its values and the values that like the humans intend for it to have. Then there's a question of if it's in that scenario, would it want to avoid having its values modified?
Yeah.
To me, it seems fairly plausible
that if the AI's values meet certain constraints
in terms of like,
do they care about consequences in the world?
Do they anticipate that the AI's kind of preserving its values will like better conduce to those consequences? Then I think it's not that surprising if it prefers not to have its values modified by the training process. But I think the way in which I'm confused about this is like with the non-Nazi being trained by Nazis, it's not just that I have different values, but I like despise their values, where I don't expect this to be true of AIs with respect to their trainers.
The more analogous scenario is where I'm like, am I leery of my values being changed as going to college or meeting new people or reading a new book? I'm like, I don't know. It's okay if it changes the values.
That's fine. I don't care.
Yeah. I think that's a reasonable point.
I mean, there's a question, you know, how would you feel about paperclips? You know, maybe you don't despise paperclips, but there's like the human paper clippers there and they're like training you to make paperclips. You know, my sense would be that there's a kind of relatively specific set of conditions in which you're comfortable having your value, especially not changed by like learning and growing, but like gradient descent directly intervening on your neurons.
Sorry, but this seems similar to like I'm already, at least a likely scenario seems like maybe more like religious training as a kid where like you start off in a religion and you're already, like, because you start off in a religion, you're already sympathetic to like the idea that you go to church every week so that you're like more reinforced in this existing tradition. You're getting more intelligent over time.
So when you're a kid, you're getting very simple like instructions about how the religion works. As you get older, you get more and more complex theology that helps you like talk to other adults about why this is a rational religion to believe in.
But since you're like one of your values to begin with was that I want to be trained further in this religion I want to come back to church every week and that seems more analogous to the situation the EIs will be in respect to human values because the entire time they're like hey you know like be helpful blah blah blah be harmless so yes it could be like that there's one there's a kind of scenario in which you were comfortable with your values being changed because in some sense you have allegiance to the, the sufficient allegiance to the output of that process. Like, so you're kind of hoping in a, in a religious context, you're like, ah, like make me more, uh, virtuous by the lights of this religion.
And, and you, you know, you go to confession and you're like, you know, uh, you know, I've been, I've been thinking about takeover today. Can you change me? please like give, give me more gradient descent.
You know, I've been bad, so bad. And so, you know, that's, people sometimes use the term corrigibility to talk about that.
Like when the AI, it maybe doesn't have perfect values, but it's in some sense cooperating with your efforts to change its values to be a certain way. So maybe it's worth saying a little bit here about what actual values the AI might have.
Would it be the case that the AI naturally has the sort of equivalent of like, I'm sufficiently devoted to this human obedience that I'm going to like really want to be modified so I'm kind of like a better instrument of the human will versus like wanting to go off and do my own thing. It could be, could be benign, you know, it could go well.
Here are some like possibilities I think about that like could make it bad. And I think I'm just generally kind of concerned about how little, I feel like I, how little science we have of model motivations, right? It's like, we just don't, I think we just don't have a great understanding of what happens in this scenario.
And hopefully we'd get one before we reach the scenario, but like, okay, so here are the kind of five, um, five categories of like motivations the model could have. And this hopefully maybe gets at this point about like, what does the model eventually do? Okay.
So one category is, uh, just like something super alien that has, you know, it's sort of like, oh, there's some weird correlate of easy to predict text or like there's some weird aesthetic for data structures that like the model, you know, early on pre-training or maybe now it's like developed that it like, you know, it really thinks things should kind of be like this. There's some something that's like quite alien to our cognition where we just like wouldn't recognize this as a thing at all.
Yeah. Right.
Another category is something, a kind of crystallized instrumental drive that is more recognizable to us. So you can imagine like AIs that develop, let's say, some like curiosity drive, because that's like broadly useful.
You mentioned like, oh, it's got different heuristics, different like drives, different kind of things that are kind of like values. And some of those might be actually somewhat similar to things that were useful to humans and that ended up part of our terminal values in various ways.
So, you know, you can imagine curiosity, you can imagine like various types of option value, like maybe it really, intrinsically, maybe it values power itself. It could value like survival or some analog of survival.
Those are possibilities too that could have been rewarded as sort of proxy drives at various stages of this process. And that kind of made their way into the model's kind of terminal criteria.
A third category is some analog of reward where the model at some point has sort of part of its motivational system has fixated on a component of the reward process, right? Like the humans approving of me or like numbers getting entered in this data center or like gradient descent doing, you know, updating me in this direction or something like that. There's some something in the reward process such that as it was trained, it's focusing on that thing.
And like, I really want the reward process to give me reward. But in order for it to be of the type where then getting reward, like motivates choosing the takeover option, it also needs to generalize such that it's concern for reward has some sort of like long time horizon element.
So it like not only wants reward, it wants to like protect the reward button for like some long period or something. Another one is like some kind of messed up interpretation of some human-like concept.
So, you know, maybe the AIs are like, they really want to be like schmaltful and like schmanist and schmarmless, right? But their concept is like importantly different from the human concept. And they know this.
So they know that the human concept would mean blah, but they like ended up, their values ended up fixating on like a somewhat different structure. So that's like another version.
And then a fourth version or a fifth version, which I think, you know, I think about less because I think it's just like such an own goal if you do this, but I do think it's possible. It's just like, you could have AIs that are actually just doing what it says on the tin.
Like you have AIs that are just genuinely aligned to the model spec. They're just really trying to benefit humanity and reflect well on open AI.
What's the other one? Help assist the developer or the user, right? But your model spec, unfortunately, was just not robust to the degree of optimization that this AI is bringing to bear. And so, you know, it decides when it's looking out at the world and they're like, what's the best way to benefit open AI and or sorry, reflectable out open AI and benefit humanity and such and so.
It decides that, you know, the best way is to go rogue. That's I think that's like a real own goal, because at that point you like you got so close you know you really you really you just had to write the model spec well um and red team it suitably but i actually think it's like possible we messed that up yeah too you know it's like kind of an it's an intense project writing like kind of constitutions and like structures of rules and stuff that are going to be robust to very intense forms of optimization so that's a final one that i'll just flag, which I think is like, it comes up even if you've sort of solved all these other problems.
Yeah, I buy the idea that like, it's possible that the motivation thing could go wrong. I'm not sure I bought, I'm not sure like my probability of that has increased by detailing them all out.
And in fact, I think it could be potentially misleading to it's like you can always enumerate the ways in which things go wrong. And the process of enumeration itself can increase your probability, whereas you're just like you had a vague cloud of like 10 percent or something and you're just like listing out what the 10 percent actually constitutes.
Yeah, totally. I'm not trying to say, like, mostly the thing I wanted to do there was just give any, like, giving some sense of, like, what might the model's motivations be, like, what are ways this could be.
I mean, as I said, my best guess is that it's partly the, like, alien thing. And, you know, not necessarily, but insofar as you're also interested in,
like, what does the model do later
and kind of like how,
what sort of future would you expect
if models did take over,
then, yeah, I think it can at least be helpful
to have some, like, set of hypotheses on the table
instead of just saying, like,
it has some set of motivations.
But in fact, I am like,
a lot of the work here is being done
by our ignorance about what those motivations are. Okay, we don't want't want humans to be like sort of violently killed and overthrown but the idea that over time they're like biological humans are not the driving force as the actors of history is like yeah that's kind of baked in right and then so like what is the we can sort of debate of debate the probabilities of the worst case scenario or we can just discuss like, I don't know, what is it that, what is the positive vision we're hoping for? Like, what is a future you're happy with? You know, my best guess when I really think about like, what do I feel good about? And I think this is probably true of a lot of, is there's some sort of more organic, decentralized process of like civilizational, incremental civilizational growth.
The type of thing we trust most and the type of thing we have most experience with right now as a civilization is some sort of like, okay, we change things a little bit. A lot of people have, there's a lot of like processes of adjustment and reaction and kind of a decentralized sense of like what's changing.
You know, was that good? Was that bad? Take another step. There's some like kind of organic process of growing and changing things, which I do expect ultimately to lead to something quite different from biological humans.
Though, you know, I think there's a lot of ethical questions we can raise about what that process involves. But I think, you know, I also, I do think we, ideally, there would be some way in which we manage to grow via the thing that really captures what do we trust in, you know, there's something, there's something we trust about the ongoing processes of human civilization so far.
I don't think it's the same as like raw competition or, you know, pure, I think there's like some rich structure to how we understand like moral progress been made and, and what it would be to kind of carry that thread forward. Um, and I don't have a formula, you know, I think, I think we're just going to have to bring to bear the full force of everything that we know about goodness and justice and beauty and every, every, we just have to, you know, bring, bring ourselves fully to the project of like making things good and, and, and doing that collectively.
Um, and I think that is, it is a really important part, I think, of our vision of like, what was an appropriate process of like deciding of like growing as a civilization is that there was this very inclusive, um, kind of decentralized element of like people getting to think and talk and, and grow and, and change things and, and react rather than some, some more like, and now the future shall be like, blah. You know, I think that's, I think we don't want that.
I think a big crux might be is like, okay, to the extent that like, the reason we're worried about motivations in the first place is because we think a balance of power, which includes at least one thing with human motivations or not human motivations human descended motivations is um difficult to the extent that we think that's the case it seems like a big crux that i often don't hear people talk about is like i don't know how you get the balance of power and maybe just like reconciling yourself with the um the models of the intelligence explosion which say that such a thing is not possible and therefore you just got to like figure out how you get the right God. But I don't know, I'm like, I don't really have a framework to think about how to balance the power thing.
I'd be very curious of like, there is a more concrete way to think about like, what is a structure of competition or lack thereof between the labs now or between countries such that the balance of power is most likely to be preserved a big part of this discourse uh at least among safety concerned people is like there's a clear trade-off between competition dynamics and race dynamics and the value of the future or how good the future ends up being. And in fact, if you buy this balance of power story, it might be the opposite.
Like maybe competitive pressures naturally favor balance of power. And I wonder if this is one of the strong arguments against nationalizing the AIs.
And like you can imagine a more sort of many different companies developing AI, some of which are somewhat misaligned and some of which are aligned. You can imagine that being more conducive to both the balance of power and to a defensive, how all the AIs go through each website and see how easy it is to hack and basically just getting society up to snuff.
If you're not just like deploying the technology widely, then the first group who can get their hands on it will be able to instigate a sort of revolution that you're just like standing against the equilibrium in a very strong way. So I definitely share some intuition there that there's, you know, at a high level, a lot of what's scary about the situation with AI has to do with concentrations of power.
And whether that power is kind of concentrated in the hands of misaligned AI or in the hands of some human. And I do think it's very natural to think, okay, let's try to distribute the power more.
And one way to try to do that is to kind of have a much more multipolar scenario where like lots and lots of actors are developing AI. And this is something that people have talked about.
When you describe that scenario, you were like, some of which are aligned, some of which are misaligned. That's key.
That's a key aspect of the scenario, right? And this is sometimes people will say this stuff. They'll be like, well, the good AIs, there will be the good AIs and they'll defeat the bad AIs.
but you know notice the assumption in there which is that you sort of made it the case that there's you can control some of the AIs right and you've got you've got
some good AIs and now it's a question of like are there enough of them and how are they how they
work made it the case that there's, you can control some of the AIs, right? And you've got some good
AIs. And now it's a question of like, are there enough of them? And how are they working relative to the others? And maybe, you know, I think it's possible that that is what happens.
There's, you know, we know enough about alignment that some actors are able to do that. And maybe some actors are less cautious, or they are intentionally creating misaligned AIs, or or, or God knows what.
Um, but if you, if you don't have that, right, if, if everyone, uh, is in some sense unable to control their AIs, um, then, uh, there's, uh, then the sort of the good AIs help with the bad AIs thing becomes like more complicated, or maybe it just doesn't work. Um there's sort of no good AIs in this scenario.
There's a lot of sort of, if you say
like everyone is building their own superintelligence that they can't control, it's true that that is
now a check on the power of the other superintelligents. Now the other superintelligences
need to like deal with other actors, but none of them are necessarily kind of working on behalf of
a given set of human interests or anything like that. So I do think that's like a very important
Thank you. But none of them are necessarily kind of working on behalf of a given set of human interests or anything like that.
So I do think that's like a very important difficulty in thinking about sort of the very simple thought of like, ah, I know what we can do. Let's just, you know, have lots and lots of AIs so that no single AI has a ton of power.
And I think, you know, that on its own, that on its own is not enough. But in this story, it's like, I'm just like very skeptical.
We end up with, I think on default, we have this training regime, at least initially, that favors a sort of like latent representation of the inhibitions that humans have and the values humans have. And I get that like, if you mess up it could go rogue like it if like multiple people are training ais they all end up rogue such that like the compromises between them don't end up with humans not violently killed like like none of them have like you it all it fails on like google's run and microsoft's run and open ai's run and yeah i mean i mean, I think there's very notable and salient sources of correlation between failures across the different runs, right?
Which is, people didn't have a developed science of AI motivations.
The runs were structurally quite similar.
Everyone is using the same techniques.
Maybe someone just stole the weights.
So, yeah, I guess I think it's really important, idea that like, to the extent you haven't solved alignment, you haven't, you likely haven't solved it anywhere. And if someone has solved it and someone hasn't, then I think it's a better question.
But if everyone's building systems that are, you know, that are kind of going to go rogue, then I don't think that's much comfort as we talked about. Yep, yep.
Okay, all right. So then let's wrap up this part here.
I didn't mention this explicitly in the introduction, so to the extent that this ends up being the transition to the next part, the broader discussion we were having in part two is about Joe C series, Otherness and Control in the age of AGI.
And the first part is where I was hoping we could just come back and just treat the main crux people come in wondering about and which I myself feel unsure about. Yeah, I mean, I'll just say on that front, I mean, I do think the Otherness and Control series is, you know, I think kind of in some sense separable.
I mean, it has a lot it has a lot to do with like misalignment stuff, but I think it's not, I think a lot of those issues are relevant, even if given various degrees of skepticism about some of the stuff I've been saying here. And by the way, so the actual mechanisms of how a taker would happen will, there's an episode with Carl Schulman, which discusses this in detail so people can go check that out.
Yeah, I think like, yeah, in terms of why is it plausible that I guys could take over from a given position, you know, in one of these projects I've been describing or something. I think Carl's discussion is pretty good and gets into a bunch of kind of the weeds that I think might give a more concrete sense.
All right, so now to part two, where we discuss the otherness and control in the Age of AGI series. First question, if in 100 years time we look back on alignment and consider it was a huge mistake, that we should have just tried to build the most raw, powerful AI systems we could have, what would bring about such a judgment? One scenario I think about a lot is one in which it just turns out that maybe kind of fairly basic measures are enough to ensure, for example, that AIs don't cause catastrophic harm, don't kind of seek power in problematic ways, et cetera.
And it could turn out that we learned that it was easy in a way that, such that we regret, you know, we wish we had prioritized differently. We end up thinking, oh, you know, I wish we could have cured cancer sooner.
We could have handled some geopolitical dynamic differently. There's another scenario where we end up looking back at some period of our history and how we thought about AIs, how we treated our AIs, and we end up looking back with a kind of moral horror at what we were doing.
So, you know, we end up thinking, you know, we were thinking about these things centrally as like products, as tools. But in fact, we should have been foregrounding much more the sense in which they might be moral patients or were moral patients at some level of sophistication that we were kind of treating them in the wrong way.
We were just acting like we could do whatever we want. We could delete them, subject them to arbitrary experiments, kind of alter their minds in arbitrary ways.
And then we end up looking back in the light of history at that as a kind of serious and kind of grave moral error. Those are scenarios I think about a lot in which we have regrets.
I don't think they quite fit the bill of what you just said. I think it sounds to me like the thing you're thinking is something more like we end up feeling like, gosh, we wish we had paid no attention to the motives of our AIs.
That we'd thought not at all about their impact on our society as we incorporated them. And instead, we had pursued a, let's call it a kind of maximize for brute power option, which is just kind of make a beeline for whatever is just the most powerful AI you can and don't think about anything else.
Okay, so I'm very skeptical that that's what we're going to wish. One common example that's given a misalignment is humans from evolution.
And you have one line in your series that here's a simple argument for AI risk. A monkey should be careful before inventing humans.
The sort of paper clipper metaphor implies something really banal and boring with regards to misalignment. And I think if I'm steelmanning the people who worship power, they have the sense of humans got misaligned and they started pursuing things if a monkey was creating them.
This is a weird analogy because obviously monkeys didn't create humans, but if the monkey was creating them, they're not thinking about bananas all day. They're thinking about other things.
On the other hand, they didn't just make useless stone tools and piled up up in caves in a sort of paper clipper fashion. There are all these things that emerged because of their greater intelligence, which were misaligned with evolution of creativity and love and music and beauty and all the other things we value about human culture.
And the prediction maybe they have,
which is more of an empirical statement
than a philosophical statement is,
listen, with greater intelligence,
if you're thinking about the paper clipper,
even if it's misaligned,
it will be in this kind of way.
It'll be things that are alien to humans,
but also alien in the way humans are aliens to monkeys,
not in the way that paper clipper is alien to a human. Cool.
So I think there's a bunch of different things to potentially unpack there. One kind of conceptual point that I want to name off the bat, I don't think you're necessarily kind of making a mistake in this vein, but I just want to name it as like a possible mistake in this vicinity is I think we don't want to engage in the following form of reasoning.
Let's say you have two entities. One is in the role of creator and one is in the role of creation.
And then we're positing that there's this kind of misalignment relation between them, whatever that means, right? And here's a pattern of reasoning that I think you want to watch out for, is to say, in my role as creator, or sorry, in my role as creation, say you're thinking of humans in the role of creation relative to an entity like evolution or monkeys or mice or whoever you could imagine inventing humans or something like that, right? You say, I'm qua creation. I'm happy that I was created and happy with the misalignment.
Therefore, if I end up in the role of creator and we have a structurally analogous relation in which there is misalignment with some creation, I should expect to be happy with that as well. Yeah.
There's a couple of philosophers that you brought up in the series, which if you read the works that you talk about, actually seem incredibly foresighted in anticipating something like a singularity, our ability to shape a future thing that's different, smarter, maybe better than us. obviously C.SS.
Lewis, Abolition of Man, we'll talk about in a second is one example. But even here's one passage from Nisha, which I felt really highlighted this.
Man is a rope stretched between the animal and the Superman, a rope over an abyss, a dangerous crossing, a dangerous wayfaring, a dangerous looking back, a dangerous trembling and halting. Is there some explanation for why? Is it just like somehow obvious that something like this is coming, even if you're thinking 200 years ago? I think I have a much better grip on what's going on with Lewis than with Nietzsche there.
So maybe let's just talk about Lewis for a second. And we should distinguish two.
There's a kind of version of the singularity that's specifically like a hypothesis about feedback loops with AI capabilities. I don't think that's pressure in Lewis.
I think what Lewis is anticipating, and I do think this is a relatively simple forecast, is something like the culmination of the project of scientific modernity. So Lewis is kind of looking out at the world, and he's seeing this process of kind of increased understanding of kind of the natural environment and a kind of corresponding increase in our ability to kind of control and direct that environment.
And then he's also pairing that with a kind of metaphysical hypothesis.
Or, well, his stance on this metaphysical hypothesis, I think, is kind of problematically unclear in the book. But there is this metaphysical hypothesis, naturalism, which says that humans, too, and kind of minds, beings, agents, are a part of nature.
And so, insofar as this process of scientific modernity involves a kind of progressively greater understanding of an ability to control nature, that will presumably at some point grow to encompass our own natures and kind of the natures of other beings that in principle we could create., uh, we could create. Uh, and Lewis views this as a kind of cataclysmic event and crisis.
Um, you know, part of what I'm trying to say and, and that in particular that it will lead to all these kind of tyrannical, uh, kind of behaviors and kind of tyrannical attitudes towards morality and stuff like that. And part of what I'm trying to, you know, unless you, in non-naturalism or in some form of kind of Tao, which is this kind of objective morality, so we can talk about that.
But part of what I'm trying to do in that essay is to say, no, I think we can be naturalists and also be kind of decent humans that remain in touch with a kind of a rich set of norms that have to do with like, how do we relate to the possibility of kind of creating creatures, altering ourselves, et cetera. But I do think, I do think his, yeah, it's like a relatively simple prediction.
It's kind of science, masters nature, humans part of nature, science masters humans. And then you also have a very interesting other essay about suppose humans, like what should we expect of other humans? The sort of extrapolation if they had greater capabilities and so on? Yeah, I mean, I think an uncomfortable thing about the kind of conceptual setup at stake in these sort of like abstract discussions of like, okay, you have this agent, it fooms, which is this sort of amorphous process of kind of going from a sort of seed agent to a like super intelligent version of itself, often imagined to kind of preserve its values along the way.
There's a bunch of questions we can raise about that.
But I think a kind of many of the arguments that people will often talk about in the context of reasons to be scared of AIs, like, oh, like, value is very fragile as you, like, foom, you know, kind of small differences in utility functions can kind of decorrelate very hard and kind of drive in quite different directions. And like, oh, like, agents have instrumental incentives to seek power.
And if they had, if it was arbitrarily easy to get power, then they would do it and stuff like that. Like, these are very general arguments that seem to suggest that they kind of, it's not just an AI thing, right? It's like no surprise, right? It's talking about like, take a thing, make it arbitrarily powerful such that it's like, you know, God, emperor of the universe or something.
How scared are you of that? Like, clearly, we should be equally scared of that.
Or, I don't know, we should be really scared of that with humans, too, right?
So, I mean, part of what I'm saying in that essay is that I think this is, in some sense,
this is much more a story about balance of power.
Right.
And about, like, maintaining a kind of,
a kind of
checks and balances and kind of distribution of power,
period. Not just about, like, kind of humans versus AIs and kind of the differences between
He made a decision. of checks and balances and kind of distribution of power, period.
Not just about like kind of humans versus AIs and kind of the differences between human values and AI values. Now, that said, I mean, I do think humans, many humans would likely be nicer if they foomed than like certain types of AIs.
So, I mean, it's not, but I think the kind of conceptual structure of the argument is not, it's sort of a very open question how much it applies to humans as well. I think one sort of big question I have is, I don't even know how to express this, but how confident are we with this ontology of expressing, like, what are agents, what are capabilities? How do we know this is the thing that's happening? Or like, this is the way to think about what intelligences are? So it's clearly this kind of very janky kind of, I mean, well, people maybe disagree about this.
I think it's, you know, I mean, it's obvious to everyone with respect to like real world human agents, that kind of thinking of humans as having utility functions is, you know, at best, a very lossy approximation of what's going on. I think it's likely to mislead as you amp up the intelligence of various agents as well, though I think Eliezer might disagree about that.
I will say that I think it's something adjacent to that that I think is like more real, that seems more real to me,
which is something like, I don't know,
my mom recently bought, you know,
or a few years ago, she like wanted to get a house.
She wanted to get a new dog.
Now she has both, you know?
How did this happen?
What is the right, actually, it's good.
She tried, it was hard.
She had to like search for the house.
It was hard to find the dog, right?
Now she has a house.
Now she has a dog. This is a very common thing that happens all the time.
And I think I don't think we need to be like my mom has to have a utility function with the dog and she has to have a consistent valuation of all the houses or whatever. I mean, like, but it's still the case that her planning and her agency exerted in the world resulted in her having this house, having this dog.
And I think it is plausible that as our kind of scientific and technological power advances, more and more stuff will be kind of explicable in that way, right? That, you know, if you look and you're like, why is this man on the moon, right? How did that happen? And it's like, well, like, there was a whole cognitive process. There was a whole like planning apparatus.
apparatus. Now, in this case, it wasn't, like, localized in a single mind.
But, like, there was a whole thing such that man on the moon, right? And I think, like, we'll see a bunch more of that. And the AIs will be, I think, like, doing a bunch of it.
And so that's the thing that seems, like, more real to me than kind of utility functions. So, yeah, the man on the moon example, there's a proximal story of how exactly NASA engineered the spacecraft to get to the moon.
There's the more distal geopolitical story of why we send people to the moon. And at all those levels, there's different utility functions clashing.
Maybe there's a sort of like meta societal world utility function. But maybe the story there is like there's some sort of balance of power between these agents.
And that's why there's the emergent thing that happens. Like why we send things to the moon is not one guy had a utility function.
But like, I don't know, cold war dot dot dot things happened. whereas I think like the alignment stuff is a lot
about like assuming that one thing is a thing that will control everything how do we control the thing that controls everything um now I guess it's not clear what you do to reinforce balance of power like it could just be that balance of power is not a thing that happens once you have things that can make themselves intelligent. But that seems interestingly different from the, how do we got to the moon story? Yeah, I agree.
I think there's a few things going on there. So one is that I do think that even if you're engaged in this ontology of kind of carving up the world into different like agencies, at the least you don't want to kind of assume that they're all like unitary or like not overlapping or like, like there's a whole, it's not like, all right, we've got this agent, let's carve out one part of the world.
It's one agent. Over here, it's like, it's this whole like messy ecosystem, like kind of teeming niches and this whole thing, right? And I think in discussions of AI, sometimes people slip between being like, well, an agent is anything that gets anything done.
Right. And they'll sort of they don't it could be like this weird mooshy thing.
And then sometimes they're like very obviously imagining like individual actor. And so that's like one one difference.
I also just think I think we should be really going for the balance of power. Like, I think it is just like not good to be like, let's, we're going to have a dictator who should, like, let's make sure we like make the dictator the right dictator.
I'm like, whoa, no, you know, like let's, uh, you know, I, I think the goal should be sort of, we all foom together, you know, it's like the whole, the whole thing in this like kind of inclusive and pluralistic way in a way that kind of satisfies the values of like tons of stakeholders, right? And is this kind of, at no point is there like one kind of single point of failure on all these things. Like, I think that's what we should be striving for here.
And I think that's true of the human power aspect of AI. And I think it's true of the AI part as well.
Yeah. Hey, everybody.
Here's a quick message from today's sponsor, Stripe. When I started the podcast, I just wanted to get going as fast as possible.
So use Stripe Atlas to register my LLC, create a bank account. I still use Stripe now to invoice advertisers and accept their payments, monetize this podcast.
Stripe serves millions of businesses, small businesses like mine, but also the world's biggest companies, Amazon, Hertz, Ford. And all these businesses are using Stripe because they don't want to deal with a Byzantine web of payments where you have different payment methods in every market and increasingly complex rules, regulations, arcane legacy systems.
Stripe handles all of this complexity and abstracts it away. And they can test and iterate every pixel of the payment experience across billions of transactions.
I was talking with Joe about paper clippers. And I feel like Stripe is the paper clipper of the payment industry where they're going to optimize every part of the experience for your users.
Which means, obviously, higher conversion rates and ultimately, as a result, higher revenue for your business. Anyways, you can go to Stripe.com to learn more.
And thanks to them for sponsoring this episode. Back to Joe.
So there's interesting intellectual discourse on, let's say, right wing side of the debate where they ask themselves, traditionally, we favor markets. But now look where our society is headed.
It's misaligned in the ways we care about society being aligned, like fertility is going down, family values, religiosity, these things we care about. GDP keeps going up.
These things don't seem correlated. So we're kind of grinding through the values we care about because of increased competition.
And therefore, we need to intervene in a major way. And then the pro-market libertarian faction of the right will say, look, I disagree with the correlations here, but even at the end of the day, like fundamentally my point is, or their point is, liberty is the end goal.
It's not the, it's not like what you use to get to higher fertility or something. I think there's something interestingly analogous about the AI competition grinding things down.
Obviously, you don't want the great goo, but the libertarians versus the strats, I think there's something analogous here. Yeah, so I mean, I think one thing you could think, which doesn't necessarily need to be about great goo, it could also just be about alignment, is something like, sure, it would be nice if the AIs didn't violently disempower humans.
It would be nice if the AIs otherwise, when we created them, kind of their integration into our society led to good places. But I'm uncomfortable with like the sorts of interventions that people are contemplating in order to ensure that sort of outcome.
Right. And I think there's a bunch of things to be uncomfortable about that.
Now, that said, so for something like everyone being killed or violently disempowered, that is traditionally something that we think, if it's real, and obviously we need to talk about whether it's real, but in the case where it's a real threat, we often think that quite intense forms of intervention are warranted to prevent that sort of thing from happening, right? So if there was actually a terrorist group that was planning to, you know, it was like working on a bioweapon that was going to kill everyone, or 99.9% of people, we would think that warrants intervention. You just shut that down, right? And now even if you had a group that was doing that unintentionally, imposing a similar level of risk, that's not, that I think many, many people, if that's the real scenario, will think that that warrants kind of quite intense preventative efforts, right? And so obviously people, you know, these sorts of risks can be used as an excuse to expand state power.
Like, there's a lot of things to be worried about for different types of, like, contemplated interventions to address certain types of risks. You know, I think we need to just, I think there's no, like, royal road there.
You need to just, like, have the actual good epistemology. You need to actually know, is this a real risk? What are the actual stakes? And, you know, look at it case by case and be like, is this, you know, is this warranted? So that's like one point on the like takeover literal extinction thing.
I think the other thing I want to say, so I talk in the piece about this distinction between the like, well, let's at least have the AIs who are kind of minimally law abiding or something like that, right? Like we don't have to talk about, there's this question about servitude and question about like other control over AI values. But I think we often think it's okay to like really want people to like obey the law, to uphold basic cooperative arrangements, stuff like that.
I do though want to emphasize, and I think this is true of markets and true of, like, liberalism in general, just how much these procedural norms like democracy, free speech, you know, property rights, things that people really hold dear, including myself, are in the actual lived substance of kind of a liberal state, undergirded by all sorts of kind of virtues and dispositions and like character traits in the citizenry, right? So like these, these norms are not robust to like arbitrarily vicious citizens. So, you know, like I want there to be free speech, but I think we also need to like raise our children to value truth and to know how to have real conversations.
And, and, you know, I want there to be democracy, but I think we also need to raise our children to be like compassionate and decent. Um, and, uh, and I think it's sometimes we can lose sight of that, that aspect.
And I think anyway, but, but I think like bringing that to mind now, that's not to say that should be the project of state power. Right.
But i think like understanding that it's liberalism is
not this sort of like ironclad structure that you can just like hit go you give like any any
citizenry and like hit go and you'll get something like flourishing or even functional right you need
there's like a bunch of other softer stuff that like makes this whole project uh go maybe zooming
out what was one question you could ask is i think the people who have i don't know if nick land
would be a good sub in here but somebody people who have a sort of fatalistic attitude towards alignment as a thing that can even make sense, they'll say things like, look, the kinds of things that are going to be exploring the black hole, the center of the galaxy, the kinds of things that go visit Andromeda or something, did you really expect them to privilege whatever inclinations you have because you grew up in the African savanna and what the evolutionary pressures were 100,000 years ago? Of course they're going to be weird, what did you think was going to happen?
I do think the even good futures will be weird. You know, I think, and I want to be clear,
when I talk about kind of like finding ways to ensure that kind of the integration of AIs into
our society leads to good places, I'm not imagining like, I think sometimes people think
that this project
of wanting that,
and especially to the extent that that makes some
deep reference to human values,
involves this like kind of short-sighted parochial,
like imposition of like our current unreflective values.
So it's just like, yeah, we're gonna have like,
I don't know,
like I think they sort of imagine this, that we're forgetting that we too, there's a kind of reflective process and a kind of a moral progress dimension that we want to like leave room for, right? You know, like whatever, Jefferson has this line about like, ah, you know, just as you wouldn't want to like force a force a man, a grown man, into, like, a younger man's coat.
So we don't want to, like, chain civilization to, like, a barber's past or whatever. Like, everyone should agree on that, including—and the people who are interested in alignment also agree on that.
So obviously there's a concern that people, like, don't engage in that process or that something shuts down the process of reflection. But I think everyone agrees we want that.
And so that will lead potentially to something that is quite different from our current conception of what's valuable. And there's a question of how different.
And I think there are also questions about what exactly are we talking about with reflection? I have an essay on this where I think this is not, I don't actually think there's a kind of off the shelf pre-normative notion of reflection that you can just be like, oh, obviously you take an agent, you stick it through reflection and then you get like values, right? Like, no, there's a bunch of types of reflect. I mean, I think that really there's just a bunch of, there's like a whole pattern of empirical facts about like take an agent, put it through some process of like reflection, all sorts of things, things ask it questions like there's like also and then that'll go in all sorts of directions for a given empirical case and then you have to look at the pattern of outputs and be like okay what do I make of that yeah um but overall I think we should expect like even the good futures I think will be quite weird um and they might even be incomprehensible, like, to us.
I don't think so. There's different types of incomprehensible.
So say I show up in the future and this is all computers, right? I'm like, okay, all right. And then they're like, we're running creatures on the computers.
So I have to somehow get in there and see what's actually going on with the computers or something like that. Maybe I can actually see, maybe I actually understand what's going on in the computers, but I don't yet know what values I should be using to evaluate that.
So it can be the case that you don't, us, if we showed up, would not be very good at like recognizing goodness or badness. I don't think that makes it insignificant though.
Like suppose you show up in a future and it's like, it's got some answer to the Riemann hypothesis, right? And you can't tell whether that answer is right. You know, maybe the civilization like went wrong.
It's still an important difference, right? It's just that you can't track it. And I think something similar is true of like worlds that are genuinely expressive of like what we would value if we engaged in like processes of reflection that we endorse versus ones that have kind of like totally veered off into something meaningless.
I think like one thing I've heard people who are skeptical of this ontology is be like,
all right, what do you even mean by alignment?
And obviously the very first question you answer, do you express like,
here's different things that could mean, do you mean balance of power?
Do you mean somewhere between like that and dictator or whatever?
Then there's another thing which is like separate from the AI discussion. discussion like i don't want the future to contain a bunch of torture and like it's not necessarily like a technical i mean like part of it might involve technically aligning gpt4 but it's like that that's not what it you know what i mean like that that's like a proxy to get to like that future the the sort of question then is what we really mean by alignment is it just like whatever it takes to make sure the future doesn't have a bunch of torture or do we mean like what i really care about is in a thousand years things that are like that are like clearly my descendants not like some thing where i like i recognize they have their own art or whatever it's like no no it's like if i if it was like my grandchild it's like that level of descendant is controlling the galaxy um even if they're not conducting torture and i think like what some people mean is like our intellectual descendants should control the light cone even if it's like even if the other counterfactual doesn't involve a bunch of torture yeah so i agree i mean i think there's a few different things there right so there's um there's kind of what are you going for you're going for like actively good you're going for avoiding right certain stuff right and then there's a different question which is what counts as actively good according to you so um maybe some people are like the only things that that are actively good are like my grandchildren, or I don't know, like some like literal descending genetic line from me or something.
I'm like, well, that's not my thing. And I don't think it's really what most people have in mind when they talk about goodness.
I mean, I think there's a conversation to be had. And obviously, in some sense, when we talk about a good future, we need to be thinking about, like, what are all the stakeholders here and how does it all fit together? But I think, yeah, when I think about it, I'm not assuming that there's some notion of, like, descendants or, like, some..., like I think there's a kind of, the thing that matters about the kind of lineage is this, whatever's required for kind of the kind of optimization processes to be in some sense pushing towards good stuff.
And there's a kind of concern that that is kind of currently a lot of what is sort of making that happen kind of lives in human civilization in some sense. And so we don't know exactly what, there's some kind of seed of goodness that we're carrying in different ways or, you know, different people, there's different notions
of goodness for different people maybe, but there's some sort of seed that is currently like
here that we have that is not sort of just in the universe everywhere. It's not just going to crop
up if you just sort of die out or something. It's something that is in some sense contingent to our civilization, or at least that's the picture we can talk about whether that's right.
And so I think the sense in which kind of stories about good futures that have to do with alignment are kind of about descendants, I think it's more about like whatever that seed is, how do we kind of carry it? How do we keep the like life thread alive going into the future? But then I'm like, one could accuse sort of the alignment community of like a sort of modern Bailey of like the mot is we just want to make sure that GPT-8 doesn't kill everybody. And after that, it's like all you guys, you know, we're all cool but then like um the real thing is
we are fundamentally pessimistic about historical processes in a way that doesn't even necessarily implicate ai alone but just like the nature of the universe and we want to do something about to make sure like the nature of the universe doesn't take a hold on humans you know know where things are headed? So if you look at Soviet Union, the collectivization of farming and the disempowerment of the kulaks was not as a practical matter necessary. In fact, it was extremely counterproductive.
It almost brought down the regime. And it obviously killed millions of people, caused a huge famine.
but it was sort of ideologically necessary in the sense of like you have we have an ember of something here and we got to make sure that uh enclave of the other thing doesn't it does have like it's a sort of like if you have raw competition between the kulak type capitalism and what we're trying to build here the gray goo of the the kulaks will just like take over right and so like we have this ember here we're gonna like do worldwide revolution from it i know that obviously that's not exactly the kind of thing alignment has in mind but like we have an ember here and like we gotta we gotta make sure that this other thing that's happening on the side doesn't you know sort of obviously that's not how they would phrase it but like get it get it told on what we're building here. And that's maybe the worry that people who are opposed to alignment have is like, you mean the second kind of thing, like the kind of thing that maybe Stalin like was worried about, even though obviously he wouldn't endorse the specific things he did.
When people talk about alignment, they have in mind a number of different types of goals, right? So one type of goal is quite minimal. It's something like that the AIs don't kill everyone, that they, or kind of violently disempower people.
Now there's a second thing people sometimes want out of alignment, which is much broader, which is something like, we would like it to be the case that our AIs are such that when we incorporate them into our society, things are good, right? That we just have a good future. I do agree that I think the discourse about AI alignment mixes together these two goals that I mentioned.
The sort of most straightforward thing to focus on, and I don't blame people for just talking about this one, is just the first one. When we think about in which context is it appropriate to try to exert various types of control or to kind of have more of what I call in the series yang, which is this kind of active kind of controlling force, as opposed to yin, which is this more kind of receptive, open, letting go.
A kind of paradigm context in which we think that is appropriate is if something is a kind of active aggressor towards against like the sort of boundaries and cooperative structures that we've created as a civilization. Right.
So, you know, I talk about the Nazis or, you know, in the piece, it's sort of like when you sort of invade, if something is invading, we often think it's appropriate to like fight back, right? And we often think it's appropriate to like set up structures to kind of prevent and kind of ensure that these basic norms of kind of peace and harmony are kind of adhered to. And I do think some of the kind of moral heft of some parts of the alignment discourse comes from drawing specifically on that aspect of our morality, right? So we think the AIs are presented as aggressors that are coming to kill you.
And if that's true, then it's quite appropriate, I think, to like really be like, okay, we, it is kind of, that's classic human stuff. Almost everyone recognizes that kind of self-defense or ensuring kind of basic norms are adhered to is a kind of justified use of certain kinds of power that would often be unjustified in other contexts.
So self-defense is a clear example there. I do think it's important, though, to separate that concern from this other concern about where does the future eventually go and how much do we want to be kind of trying to steer that actively.
So to some extent, I wrote the series partly in response to the thing you're talking about, which is, I think it is true that aspects of this discourse involve the possibility of like trying to grip, like I think trying to kind of steer and grip and like kind of rent, you have the sense the universe is about to kind of go off in some direction and you need to, and you know, people notice that muscle. And part of what I want to do is like, well, we have a very rich ethical, human ethical tradition of thinking about like, what, when is it appropriate to try to exert what sorts of control over which things? And I want that to be, I want us to bring the kind of full force and richness of that tradition to this discussion, right? And not, like, I think it's easy if you're purely in this abstract mode of like, utility functions, human utility function, and there's like this competitor thing with utility function.
It's like somehow you lose touch with the kind of complexity of how we actually, like we've been dealing with kind of differences in values, but, and kind of competitions for power. This is classic stuff.
Right. And I don't actually think, I think the AI sort of amplify a lot of the, um, the, the kind of dynamics, but I don't think it's sort of fundamentally new.
And so part of what I'm trying to say is like, well, let's draw on the full wisdom we have here while obviously adjusting for like ways in which things are different. So one of the things the Ember analogy brings up and getting a hold of the future is we're going to go explore space and that's where we expect most of the things that will happen.
Most of the people that will live, it'll be in space.
And I wonder how much of the high stakes here is not really about AI per se, but it's about space.
Like it just, it's a coincidence that we're developing AI at the same time we are like
on the cusp of expanding through most of the stuff that exists.
So I don't think it's a coincidence in that I think, the centrally, like the way we would become able to expand, or the kind of most salient way to me, is via some kind of radical acceleration of our technological... Sorry, let me clarify.
Then like, the stakes here, like, if this was just a question of, do we do AGI and explore the solar system and there was nothing beyond the solar system, like we fool them and weird things might happen to the solar system if we get it wrong. I feel like compared to that, like billions of galaxies has a different sort of that's what's at stake.
I wonder how much of the discourse is like hinges on the stakes because of space. I mean, I think for most people, very little.
You know, I think people are really like, what's going to happen to this world, right? This world around us that we live in as we, and you know, what's that going to happen to me and my kids? And so I don't actually think, you know, some people spend a lot of time on the space stuff, but I think for the most immediately most immediately pressing stuff about AI doesn't require that at all. Um, I also think like, even if you bracket space, like time is also very big.
Uh, and so, uh, you know, whatever we've got like 500 million years, billion years left on earth, if we don't mess with the sun and maybe you could get more out of it. So like, you know, I think there's still, um, uh, that's a lot.
Uh, so, and then, and then I guess, but yeah, I don't, but I don't know if it like fundamentally changes the narrative. Like I do think, I mean, obviously the stakes insofar as you care about what happens, you know, in the future or in space, then like the stakes are way smaller if you, if you shrink, um, shrink down to the, to the solar system.
Um, and I think that does change potentially some stuff in, like a really nice feature of our situation right now, depending on what the actual nature of kind of the kind of resource pie is, is that I think, you know, in some sense, there's such an abundance of energy and other resources in principle available to a kind of responsible civilization that really just tons of stakeholders, especially ones who are like able to kind of saturate, get like really close to like amazing according to their values with like kind of comparatively small allocations of resources or something like we can just, you know, I sort of, I kind of feel like everyone who has like kind of satiable values who will be like really, really happy with like some like small kind of fraction of the available pie, we should just like satiate all sorts of stuff. Right.
And obviously you need to do like, you know, figure out gains from trade and balance and like very, there's like a bunch of complexity here. But I think in principle, you know, we're in a position to, um, to, to create a really wonderful, wonderful scenario for just tons and tons of different value systems.
Um, and, and so I think correspondingly, we should be really interested in, in doing that. Right.
And, you know, so I, I sometimes use this heuristic in thinking about the future.
You know, I think we should be aspiring to really kind of leave no one behind. Right.
Like really find like who are all the stakeholders here? How do we really have like a fully inclusive vision of like how the future could be good from a very, very wide variety of perspectives. And and I think the kind of vastness of space resources like makes that a lot, makes that very feasible.
And now if you instead imagine
it's a much smaller pie, well, maybe you face a tougher trade-offs. And so I think that's like an important dynamic.
Is the inclusivity because of part of your values includes different potential futures getting to play out? Or is it because I'm uncertainty about which the right one is? So let's make sure we're not nulling out the possible, if you're wrong, you're not nulling out all value. I think it's a bunch of things at once.
So yeah, I'm just, I'm really into being nice when it's cheap, right? Like I think if you can if you can just help someone a lot yeah in a way that's really cheap for you do it right or like i don't know i mean obviously you need to think about trade-offs and there's like a lot of people in principle you could be nice to but i think like the principle of like be nice when when it's cheap um i'm like very excited to try to uphold i also really hope that um kind of other people uphold that with respect to me including the ais right like think we should be kind of golden ruling. Like we're thinking about, oh, we're going to inventing these AIs.
Like I think there's some way in which I'm trying to like kind of embody attitudes towards them that I like hope that they would embody towards me. And that's, that's like some, it's unclear exactly what the ground of that is, but that's something, you know, I really liked the golden rule.
And I think think and i think a lot about that as a kind of um basis for uh treatment of other beings um and so i think like be nice when it's cheap is like a if you think about it if everyone implements that rule um then we get potentially like a a big kind of pareto improvement or like so um i don't know exactly pareto improvement but it's like good deal it's a lot of a lot of good deals. And yeah, so I think it's that I'm just into pluralism.
I've got uncertainty, you know, there's like all sorts of stuff swimming around there. But, and then I think also just as a matter of like having kind of cooperative and kind of good balances of power and deals and kind of avoiding conflict, I think like finding ways to set up structures that lots and lots of people and value systems and agents are happy with, including non-humans, you know, people in the past, AIs, animals, like I really think we should be like, we should have very broad, broad sweep in thinking about what sorts of inclusivity we want to be kind of reflecting in a kind of mature civilization and kind of setting ourselves up for doing that.
Okay, so I want to go back to the, which should our relationship with these AIs be? Because pretty soon we're talking about our relationship to superhuman intelligences, if we think such a thing is possible. And so there's a question of what is the process you get to use to get there and the morality of gradient dissenting on their minds which we can address later the thing that gives personally me the most unease about alignment quote-unquote is um at least part of part of the vision here sounds like you're going to enslave a god.
And there's just something that feels so wrong about that.
But then the question is, if you don't enslave the god,
obviously the god's going to have more control.
Are you okay with you're going to surrender most of everything?
Obviously, you know what I mean?
Even if it's a cooperative relationship you have. I think we as a civilization are going to have to have a very serious conversation about what sort of kind of servitude is appropriate or inappropriate in the context of AI development.
And I think we, there are a bunch of disanalogies
from human slavery that I think are important.
You know, in particular,
A, the AIs might not be moral patients at all,
in which case, you know, so we need to figure that out.
There's, you know, there are ways in which
we may be able to kind of, you know,
have kind of motivation, like,
slavery involves all this, like, suffering and kind of non-consent. And there's all these like specific dynamics involved in human slavery.
But I think like, and so some of those may or may not be present in a given case with AI. And I think that's important.
But I think overall, like we are going to need to stare hard at, like right now, the kind of default mode of how we treat AIs gives them no moral consideration at all. We're thinking of them as property, as tools, as products, and designing them to be assistants and stuff like that.
And I think there's been no official communication from any AI developer as to when, under what circumstances, that change. And so I think there's a conversation to be had there that we need to have.
And so, and I think there's a bunch of, yeah, so there's a bunch of stuff to say about that. I want to push back on the notion that there's sort of two options.
There's like enslaved God, whatever that is, and like loss of control.
Yeah.
And I think like we can do better than that, right?
Like let's, let's work on it.
Let's try, let's try to do better.
Especially, you know, just sort of, I think we can, I think we can do better.
And I think it might require being thoughtful and it might require being kind of having,
you know, a kind of mature discourse about this before we start taking irreversible moves.
But I'm optimistic that we can at least avoid some of the connotations and a lot of the stuff at stake in that kind of binary. With respect to how we treat the AIs, so I have a couple of contradicting intuitions.
And the difficulty with using intuitions in this case is obviously it's not clear what reference class an AI we have control over is. So to give one that's very scared about the things we're going to do to these things.
If you read about like life under Stalin or Mao, it's if you're there's one version of telling it which is actually very similar to what we're what we mean by alignment which is um we do these like black box experiments about like we're gonna make it think that it can defect and if it does we know it's misaligned and if you mao the hundred flowers campaign where um uh you know I hundred flowers boom i'm gonna allow criticism of my regime so on and that lasted for a couple of years and afterwards everybody who did that that was a way to find quote unquote the snakes um who are the rightest who are secretly hiding and you know will like purge them um the the the sort of paranoia of defectors like anybody in my anybody inourage, any of my regime, they could be a secret capitalist trying to bring down the regime. That's one way of talking about these things, which is very concerning.
Is that the correct reference class? I certainly think concerns in that vein are real. I mean, I think if you, it is disturbing how easy many of the analogies with kind of human historical events and practices that we kind of deplore, or at least have a lot of weariness towards are as in the context of the kind of way you end up talking about kind of AI, maintaining control over AI, like making sure that it doesn't rebel.
I think we should be noticing the kind of reference class that some of that talk starts to conjure. And so basically just, yes, I think we should be very we we should really notice that.
Um, you know, part of what I'm trying to do in the series is to bring the kind of full range of considerations at stake into play, right? Like, I think, um, uh, it is both the case that like you, that we should be quite concerned about like being kind of overly controlling or, or, you know, abusive or oppressive, or there's all sorts of ways you can, um, go too far. And I think, you know, uh, there are concerns about the AIs being genuinely dangerous and genuinely, um, uh, you know, acting, you know, killing us, finally overthrowing us.
I think,, and I think the moral situation is quite complicated. And then I think in some sense, so often when you're, when, if you imagine a sort of external aggressor who's coming in and invading you, you feel very justified in doing like a bunch of stuff to prevent that.
it's like a little bit different when you're like inventing the thing and you're doing it like incautiously or something. And then you're also like, I think there's sort of moral justification you have for like, there's a different vibe in terms of like the kind of overall.
Yeah. justificatory stance you might have for various types of, like,
more kind of power-exerting interventions. And so, like, that's, like, one feature of the situation.
The opposite perspective here is that you're doing this sort of vibes-based reasoning of, like, ah, that looks yucky, of, like, doing gradient descent on these minds mines and in the past a couple of references a couple similar cases might have been something like environmentalists not liking nuclear power and because the vibes of nuclear don't look green but obviously that's set back the cause of fighting climate change and so the end result of like a future you're proud of a future future that's appealing, is said bad because like your vibes about what would be wrong to brainwash a human, but you're trying to apply to a disanalogous case where that's not as relevant. And I do think there's a concern here that I, you know, I really try to foreground in the series that I think is related to what you're saying, which is something like, you know, you might be worried that we will be very gentle and nice and free with the AIs, and then they'll kill us.
You know, they'll take advantage of that, and then it will have been like a catastrophe, right? And so I opened the series basically with an example that I'm really trying to conjure that possibility at the same time as conjuring the grounds of gentleness and the sense in which it is also the case that these AIs could be, they can both be like others, moral patients, like this sort of new species in the sense of that should conjure like wonder and reverence and such that they will kill you. And so I have this example of like, ah, this documentary Grizzly Man, where there's this environmental activist, Timothy Treadwell, and he aspires to approach these grizzly bears.
He lives, you know, in the summer, he goes into Alaska and he lives with these grizzly bears and he aspires to approach them with this like gentleness and reverence. He doesn't use bear mace or he doesn't like carry bear mace.
He doesn't use a fence around his camp. And he gets eaten alive by the bears or one of these bears.
And, you know, and I kind of really wanted to foreground that possibility in the series. Like, I think we need to be talking about these things both at once, right? And bears can be moral patients, right? AIs can be moral patients, Nazis are moral patients, enemy soldiers have souls, right? And so I think we need to learn the art of kind of hawk and dove both, like kind of, there's this like dynamic here that we need to be able to hold both sides of as we kind of go into these trade-offs and these dilemmas and all sorts of stuff.
And part of what I'm trying to do in the series is really kind of bring it all to the table at once. I think the big crux that I have, like if I today was to massively change my mind about what should be done is just the question of how weird by default things end up how alien they end up and a big part of that story is the you made a really interesting argument on your blog post that if moral realism is correct that actually makes an empirical prediction which is that the aliens the asis whatever should converge on the right morality the same way that they converge on the right mathematics um that's a really interesting point but there's another prediction that moral realism makes which is that over time society should become more moral become better and to the extent that we think that's happened, of course, there's the problem of what morals do you have now? Well, it's the ones that society has been converging towards over time.
But to the extent that it's happened, one of the predictions of moral realism has been confirmed, which means should we update in favor of moral realism? One thing I want to flag is I don't think all forms of moral realism make this prediction. And so that's just one point.
I'm happy to talk about the different forms I have in mind. I think there are also forms of kind of things that kind of look like moral anti-realism, at least in their metaphysics, according to me, but which just posit that in fact there's this convergence.
It's not in virtue of interacting with some like, kind of mind independent moral truth, but just like, it's just for some other reason, it's the case that, and that looks like a lot like moral realism at that point, because it's kind of like, oh, it's really universal. Like everyone ends up here and it's kind of tempted to be like, ah, like why, right? Is that, and then whatever answer for the why is a little bit like, is that the Tao? Is that the nature of the Tao? Even if there's not sort of a, an extra metaphysical realm in which the, the, the moral lives or something.
So, um, uh, yeah, so, so moral convergence, I think is sort of a different factor from like the existence or non-existence of, um, kind of non-natural, like a kind of morality that's not reducible to natural facts, which is the type of moral realism I, I usually consider. Now, okay, so does the improvement of society, is that an update towards moral realism? I mean, I guess like, so maybe it's like a very weak update or something.
Like, I guess I'm kind of like, which view like predicts this hard? I guess it feels to me like moral anti-realism is like very comfortable with the observation of the like people with certain values have those values. Well, yeah.
So there's obviously this like first thing, which is like any, if you're the culmination of some process of moral change, then it's very easy to look back at that process and be like moral progress, like the arc of history bends towards me. You can look more, like if it was like, if there was a bunch of dice rolls along the way, you might be like, oh wait, that's not rational.
That's not the march of reason. So there's still like empirical work you can do to tell whether that's what's going on.
But I also think it's just, you know, on moral anti-realism, I think it's just still possible say like, consider Aristotle and us. Right.
And we're like, okay, has there been moral progress by Aristotle's lights or something, you know, does, uh, and, and our lights too. Right.
Um, and you could think, ah, doesn't, isn't that a little bit like moral realism? It's like these hearts are singing singing in harmony. That's the moral realist thing.
The anti-realist thing, the hearts all go different directions, but you and Aristotle apparently are both excited about the march of history. some open question about whether that's true like what are Aristotle's like reflective values right suppose it is true
I think that's fairly explicable
in moral anti-realist terms you can say
roughly that like, yeah, you and Aristotle are sufficiently similar and you endorse sufficiently similar kind of reflective processes. And those processes are in fact instantiated in the march of history that, yeah, you know, history has been good for both of you.
And I don't think that's, you know, I think there are worlds where that isn't the case. And so I think there's a sense in which maybe that prediction is more likely for realism than anti-realism, but I don't, it doesn't like move me very much.
One thing I wonder is, look, there's, I don't know if moral realism is is the right word but the thing you mentioned about there's something that makes hearts converge to the thing we are or the thing we upon reflection would be and even if it's not something that's like instantiated in the realm beyond the universe it's it's like a force that exists that acts in a way we're happy with to the extent that doesn't exist and you let go of the reins and then you get the paper clippers feels like we were doomed a long time ago in the sense of yeah i just different utility functions banging against each other and some of them have parochial preferences but like you know it just combat and some guy won um whereas in the world where like no this is this is the thing like these are where the hearts are supposed to go or uh it's only by catastrophe that they don't end up there that's sort of that feels like the world where like really matters and in that world the worry the first initial question i asked is like what would make us think that alignment was a big mistake? In the world where the hearts just naturally end up towards like the thing, what we want, maybe it takes an extremely strong force to push them away from that. And that extremely strong force is you solve technical alignment and just like, no, yeah, you're just like the blinders on the horse's eyes.
So like in the worlds where like, the worlds that really matter, where like, ah, this is where the hearts want to go. In that world, maybe alignment is what fucks us up.
On this question of kind of do the worlds where there's not this kind of convergent moral force, whether kind of metaphysically inflationary or not, matter, or are those the only worlds that matter? Or sorry, maybe what i meant was in those worlds like you're kind of fucked it's like yeah maybe the or the world's without that the world's where there's no dow yeah yeah let's use the term dow for like this kind of convergent morality over the course of millions of years like it was gonna go somewhere one way or another it wasn't gonna end up your your particular utility function okay well let's let's distinguish between um ways you can be doomed one way is kind of um philosophical so you could you could be the sort of moral realist you know or kind of realist-ish person of which there are many who have the. They're like, if not moral realism, then nothing matters, right? It's dust and ashes.
It is my metaphysics and or like normative view or the void, right? And I think this is a common view. I think Derek Parfit, at least some comments of Derek Parfit's, suggest this view.
I think lots of moral realists will like kind of profess this view. Eliezer Yurkowsky, I think that there is sort of some sense in which I think his early thinking was inflected with this sort of thought.
He later recanted very hard. So I think this is importantly wrong.
And so here's my, here's the case. I have an essay about this.
It's called Against the Normative Realists Wager. And here's the case that convinces me.
So imagine that a metaethical fairy appears before you, right? And this fairy knows whether there is a DAO. And the fairy says, okay, I'm going to offer you a deal.
If there is a DAO, then I'm going to give you $100 dollars. If there isn't a Dow, then I'm going to burn you and your family and a hundred innocent children alive.
Right. Um, okay.
So, so claim, don't take this deal, right? This is a bad deal. You're holding hostage your commitment to, to not being burned alive or like your care for that to this, like abstruse, basically you're you're you're you're you're you're you're
you hostage your commitment to not being burned alive or like your care for that to this like abstruse like basically you're um uh yeah like i i think i mean i go through in the essay a bunch of different ways in which i think this is wrong but i think just like and i think these people who kind of pronounce they're like moral realism or the void like they don't actually think about bets like this i'm like no no okay so really like is that what you want to do um and uh uh no i I think we should, we should, should, I still care about my value. My sort of allegiance to my values, I think is kind of outstrips my like commitments to like various like meta-ethical interpretations to my values.
I think like we should, the sense in which we like care about not being burned alive is much more solid than like our kind of, you know, then the reasoning and on what matters. Okay.
So that's, that's this, that's like the sort of philosophical doom. Right.
Now you could have this, it sounded like you, you were also gesturing at, at a sort of empirical doom, right? Which is like, okay, dude, if it's all, if it's just going in a zillion directions, come on, you think it's going to go go in your direction like there's going to be so much churn um like you're just going to lose um and uh and so uh you know you should give up now and and kind of only only fight for the for the the realism worlds there i'm like i mean so i think you know you got to do the value calculation. You got to like actually have a view about like how doomed are you in these different worlds? What's the tractability of changing different worlds? I mean, I'm quite skeptical of that.
But that's a kind of empirical claim. I will say I'm also just like kind of low on this, like everyone converges thing.
So, you know, if you imagine like, you know, you train a chess playing AI or a, you have a real paper clipper, right? You're like, somehow you had a real paper clipper and then you're like, okay, you know, go and reflect. Based on my like understanding of like how moral reasoning works, like if you look at the type of moral reasoning that like analytic ethicists do, it's just reflective equilibrium, right? They just like take their intuitions and they systematize them.
I don't see how that process gets a sort of injection of like the kind of mind independent moral truth. Or like, I guess it, like if you sort of start with like only all of your intuitions say to maximize paperclips, I don't see how you end up maximizing or like doing some like rich human morality.
I just don't, like, it doesn't look to me like that's how human ethical reasoning works. I think like most of what normative philosophy does is make consistent and kind of systematize pre-theoretic intuitions.
And so, and I think, but we'll get evidence about this. Like, you know, in some sense, I think this view predicts like, you know, you keep trying to train the AIs to like do something and they keep being like, no, I'm not going to like do that.
It's like, no, that's not good. Or so they keep like pushing back.
Like the sort of momentum of like AI cognition is like always in the direction of this like moral truth. And whenever we like try to push it in some other direction, we'll find kind of resistance from like the rational structure of things.
So sorry, actually, I've heard from researchers who are doing alignment that like for red teaming inside these companies, they will try to red team a base model. So it's not been our relationship to just like predict next token, the raw, crazy, whatever, Shagath.
And they try to get this thing to, hey, help me make a bomb, help me whatever. And they say that it will like, it's odd how hard it tries to refuse even before it's been RLA Chef.
I mean, look, it will be a very interesting fact if it's like, man, we keep training these AIs in all sorts of of different ways like we're doing all this crazy stuff and they keep like acting like bourgeois liberals it's like wow like that's or you know they keep like really or they keep professing this like weird alien reality they all converge on this one thing they're like can't you see it's like Zorgel like Zorgel is the thing and like all the AIs you know interesting very interesting I think my personal prediction is that that's not what we see um and my actual prediction is that the AIs are going to be very malleable like we're going to be like you know if you push an AI towards evil like it'll just go um and and I think that's um uh obviously or sort of reflectively consistent evil I I mean, I think there's also a question
with some of these AIs.
It's like, will they even be consistent
in their values, right?
I do think like a thing we can do,
so I like this image of the blinded horses
and I like this image of like,
maybe alignment is going to mess with the,
I think we should be really concerned
if we're like forcing facts on our AIs, right?
Like that's like a really bad,
because like I think one of the clearest things
about human processes of reflection
Here we go. I think we should be really concerned if we're like forcing facts on our AIs, right? Like that's like a really bad, because like I think one of the clearest things about human processes of reflection, like the kind of easiest thing to be like, let's at least get this, is like not acting on the basis of an incorrect empirical picture of the world, right? And so if you find yourself like asking, by the way, like this is true and I need you to always be reasoning as though blah is true.
I'm like, ooh, I think that's a no, no from an anti-realist perspective too. Right.
Because I want to, I want to like my reflective values, I think will be such that I formed them in light of the truth about the world. And I think this is a real concern, as we move into this era of kind of aligning AIs, I don't actually think this like binary between like values and other things is going to be very obvious in how we're training them.
I think it's going to be much more like ideologies and like, you can just train an AI to like output stuff, right? Output utterances. And so you can easily end up in a situation where you like decided that blah is true about some issue, um, an empirical issue, right? Not a moral issue.
And, uh, so like, I think, I think people should not, for example, I do not think people should hard code belief in God into their AIs or like, I would, I would advise people to not hard code their religion into their AIs. If they also want to like discover if their religion is false.
Um, I would, I would just in general, if you, if you would like to have your behavior be sensitive to whether something is true or false, like it's sort of generally not good to like etch it into things. And so that is definitely a form of blinder I think we should be really watching out for.
And I'm kind of hopeful. So like I have enough credence on some sort of moral realism that like I'm hoping that if we just do the anti-realism thing of like just being consistent learning all the stuff reflecting like I don't if you look at how like moral realists and moral anti-realists actually do normative ethics it's the same it's basically the same there's like some amount of like different heuristics on like things like properties like simplicity and stuff like that but I think it's like they're mostly just doing the same game.
And so I'm kind of hoping that, and also meta ethics is itself a discipline that AIs can help us with. I'm hoping that we can just figure this out either way.
So if there is, if moral realism is somehow true, I want us to be able to notice that. And I want us to be able to like adjust accordingly.
So I'm not like writing off those worlds and be like, let's just like totally assume that's false.
But the thing I really don't want to do is write off the other worlds where it's not true.
Because my guess is it's not true.
Right.
And I think stuff still matters a ton in those worlds too. So one big crux is like, okay, you're training these models.
We're in this incredibly lucky situation where it turns out the best way to train these models is to just give them everything humans ever said, written, thought. And also these models, the reason they get intelligence is because they can generalize, right? Like they can rock.
What is it? What is the gist of things? So are we fundamentally very, should we just expect this to be a situation which leads to alignment in the sense of how exactly does this thing that's trained to be an amalgamation of human thought become a paper clipper? The thing you kind of get for free is it's an intellectual descendant. The paper clipper is not an intellectual descendant.
whereas the AI which understands all the human concepts but then gets stuck on some part of it
which you aren't totally comfortable with
is like you know it's it feels like an intellectual descendant in the way we care about i'm not sure about that i'm not sure i i i'm not sure i do care about a notion of intellectual descendant in that sense like if you imagine i mean literal paperclips is a human concept right right? So, um, uh, I don't think any old, any old human concept will, will do, uh, for, uh, the thing, the thing we're excited about. Um, I think the stuff that I would be more interested in the possibility of getting for free are things like, um, consciousness, pleasure, um, sort of other features of human cognition.
like I think, so there are paper clippers and there are paper clippers, right? So imagine if the paper clipper is like an unconscious kind of voracious machine, it's just like, it appears to you as a cloud of paper clips, you know? But there's nothing sort of, that's like one vision. If you imagine the paper club is like a conscious being that like loves paper clips right it like takes pleasure in making paper clips um that's like a different thing right um and obviously it could still it's not necessarily the case that like you know it makes the the future all paper clippy is probably not optimizing for consciousness or pleasure right it cares about paper clips maybe maybe eventually if it's like suitably certain it like uses it turns itself into paperclips and who knows but like it's probably not optimizing for consciousness or pleasure, right? It cares about paperclips.
Maybe eventually if it's like suitably certain, it like uses, it turns itself into paperclips and who knows? But like, it's still, I think a different, it's actually a somewhat different moral kind of mode with respect. That looks to me much more like a, you know, there's also a question of like, does it try to kill you and stuff like that? But I think that there are kind of features of the agents we're imagining other than the kind of thing that they're staring at that can matter to our sense of like sympathy similarity um and uh yeah and i think people have different views about this so so one one possibility is that human consciousness like the thing we care about in consciousness or sentience is super contingent and fragile and like um most minds most like kind of smart minds are not conscious right it's like uh the thing we care about in consciousness or sentience is super contingent and fragile.
And like most minds, most like kind of smart minds are not conscious, right? It's like the thing we care about with consciousness is this hacky contingent. It's like a product of like specific constraints, evolutionarily genetic bottlenecks, et cetera.
And that's why we have this consciousness and like you can get similar work done. Like so consciousness presumably does some, some sort of work for us, but you can get similar work done in a different mind in a very different way.
And you should sort of, so that's like,
that's the sort of consciousness that's fragile view, right?
And I think there's a different view, which is like,
no, consciousness is something that's quite structural.
It's much more defined by functional roles,
like self-awareness, a concept of yourself,
maybe higher order thinking,
stuff that you really expect in many sophisticated minds. And in that case, okay, well now actually consciousness isn't as fragile as you might have thought, right? Now actually like lots of beings, lots of minds are conscious and you might expect at the least that you're going to get like conscious super intelligence.
They might not be optimizing for creating tons of consciousness. Um, but, uh, you might expect consciousness by default.
And then we can ask similar questions about something like valence or pleasure or like the kind of character of consciousness, right? So there's, um, uh, you can have a kind of cold, indifferent consciousness that has no like human or no, like emotional warmth, no like pleasure or pain. Um, I think Like that can still be dave chalmers has some papers about like vulcans and he talks about they still have moral patient hood um i think that's very plausible but i do think it's like um an additional thing you could get for free or like get quite commonly depending on on its nature is something like pleasure um again and then we have to ask how janky is pleasure how specific and contingent is the thing we care about in pleasure versus how robust is this as a functional role in, in like minds of all kinds.
And I personally don't know on this stuff. And I don't think, I don't think this is like enough to get you alignment or something, but I think it's at least worth being aware of like these, these other features.
We're not sort of talking, we're not really talking about the values in this case. We're talking about like the kind of structure of its mind and the different properties the minds have.
And I think that could show up quite robustly. So part of your day job is, you know, writing these kinds of section 2.2.2.5 type reports.
And part of it is like, society is like a tree that's growing towards the light. What is it like context switching between the two of them? So I actually find it's kind of quite complimentary.
So yeah, I will write these sort of more technical reports and then do this sort of kind of more literary writing and philosophical writing. And I think they both draw in kind of like different parts of myself and I try to think about them in different ways.
So I think about the some of the reports as are much more like this is like I'm kind of more fully optimizing for like trying to do something impactful or trying to kind of kind of yeah, there's kind of more of an impact orientation there. And then on the kind of essay writing, I give myself much more way to kind of, yeah, there's kind of more of an impact orientation there.
And then on the kind of essay writing, I give myself much more leeway to kind of, yeah, just let other parts of myself and other parts of my concerns kind of come out and kind of, you know, self-expression and like aesthetics and other sorts of things. Even while they're both, I think for me, part of an underlying kind of similar concern or, you know, an attempt to have a kind of integrated orientation towards the situation.
Could you explain the nature of the transfer between the two? So in particular, from the literary side to the technical side, I think rationalists are known for having a sort of ambivalence towards great works or humanities. Are they missing something crucial because of that? Because one thing you notice in your essays is just lots of references to epigraphs, to lines in poems or essays that are particularly relevant.
I don't know. Are the rest of the rationalists missing something because they don't have that kind of background i mean i don't want to speak i think some rationalists you know lots of last rationalists like a lot of these different things i do you think um by the way i'm just referring specifically to spf as a post about like how shakespeare could be uh like the base rates of shakespeare being a great writer and also books can be condensed to essays well so on just the general question of like how should how should people value great works or something? I think people can kind of fail in both directions, right? And I think some people, maybe like maybe SPF or other people, they're sort of interested in puncturing a certain kind of like sacredness and prestige that people can try to kind of like, yeah, that people associate with some of these works.
And I think there's a way in which, but as a result can miss some of the like genuine value. But I think they're responding to a real failure mode on the other end, which is to kind of, yeah, be too enamored of this prestige and sacredness to kind of siphon it off as some like weird legitimating function for your own thought instead of like thinking for yourself.
Losing touch with like, what do you actually think or what do you actually learn from? Like, I think sometimes, you know, these epigraphs, careful, right? I mean, it's like, I think, you know, and I'm not saying I'm immune from these vices. I think there can be a like, ah, but Bob said this and it's like, oh, very deep, right? And it's like, these are humans like us.
Right. And I think, I think the canon and like other great works and all, you know, all sorts of things have a lot of value and, you know, we shouldn't, I think sometimes it like borders on the way people like read scripture, or I think like there's a kind of like scriptural authority that people will sometimes like ascribe to these things.
And I think that's not, um, so yeah, I think it's kind of, you know, you can fall off on both sides of the horse. It actually relates really interestingly to, I remember I was talking to somebody who at least is familiar with rationalist discourse and I was telling, he was asking like, what are you interested in these days? And I was saying something about this part of Roman history is super interesting.
And then his first sort of response was, oh, you know, it's really interesting when you look at these secular trends of like Roman times to um what happened in the dark ages versus the enlightenment um for him it was like the story of that was just like how did it contribute to the big secular like the big picture the sort of particulars didn't they don't like there's no interest in that it's just you zoom out at the biggest level, what's happening here? Whereas, there's also the opposite failure mode when people study history. Dominic Cummings writes about this because he is endlessly frustrated with the political class in Britain.
And he'll say things like, well, you know, they study politics, philosophy, and economics. And a big part of it is just like being really familiar with these poems and like reading a bunch of history about the war of the roses or something but he's frustrated that they take away they have all these like kings memorized but they they take away very little in terms of lessons from these episodes um it's more of just like almost entertainment like watching game of thrones for them whereas he thinks like oh we're repeating certain mistakes that he's seen in history.
Like, he can generalize in a way they can.
So the
first one seems like a mistake. I think C.S.
Lewis
talks about it in one of the essays
you cited, where it's like, if you see through everything,
it's like, you're really blind, right? Like, if
everything is transparent. I mean, I think
there's kind of very little excuse
for, like, not
learning history. Or
I don't know. Sorry.
I mean, I'm not saying I like have learned enough history.
I guess I feel like even when I try to channel
some sort of vibe of like skepticism
towards like great works,
I think that doesn't generalize
to like thinking it's not worth understanding human history.
I think human history is like, you know,
just so clearly, you know, crucial to kind of understand this is what, it's what's structured and created all of the stuff. And so, you know, there's an interesting question about like, what's the level of scale at which to do that, right? And how much should you be like, yeah, looking at details, looking at macro trends and that's, you know, that's.
Um, I do think it's nice. I think it's nice for people to be like, um, at least attending to the kind of macro narrative.
I think there's like a, there's some virtue in like having a worldview, like really like building a model of the whole thing, which I think sometimes gets lost in like, um, the details. Uh, and, um, but obviously like if you're too, you know, the details are what the world is made of and so you say if you don't have those you don't have data uh at all so so um yeah it seems like there's some skill in like learning history history well this actually seems related to you have a post on sincerity and i think like um if i'm getting the sort of the vibe of the piece right it's like at least in the context of let's say intellectuals certain intellectuals have a vibe of like shooting the shit and they're just like trying out different ideas how did these like how did these analogies fit together maybe there's some and those seem closer to the i'm looking at the particulars and like oh this is just like that one time in the 15th century where they overthrew this king and they blah, blah, blah.
Whereas the this guy who was like, oh, here's a secular trend from like the if you look at the growth models for like a million years ago to now, it's like, here's what's happening. That that one has a more sort of sincere flavor some people especially when it comes to ai discourse have a very um the sincere mode of operating is like i've thought through my bio anchors and i like disagree with this premise so here my effective compute estimate is different in this way here's how i analyze the scaling laws.
And if I could only have one person to help me guide my decisions on the AI, I might choose that person. But I feel like if I could choose between, if I had 10 different advisors at the same time, I might prefer the shooting the shit type characters who have these weird esoteric intellectual influences and they're almost like random number generators.
They're not especially calibrated, but once in a while they'll be like, oh, this one weird philosopher I care about or this one historical event I'm obsessed with has an interesting perspective on this. And they tend to be more intellectually generative as well because they're not...
I think one big part of it is that if you are so sincere, you're like, oh, I've thought through this. Obviously ASI is the biggest thing that's happening right now.
It doesn't really make sense to spend a bunch of your time thinking about how did the Comanches live and what is the history of oil and how did Gerard think about conflict? What are you talking, come on, like ASI is happening in a few years, right? Whereas, and, but therefore the people who go on these rabbit holes were because they're just trying to shoot the shit have, I feel like are more generative. I mean, it might be worth distinguishing between something like kind of intellectual seriousness.
Right. And something like how diverse and wide-ranging
and kind of idiosyncratic are the,
you know, things you're interested in.
Right.
And I think maybe there's some correlation
where people who are kind of like,
or maybe intellectual seriousness
is also distinguishable
from something like shooting the shit.
Like maybe you can shoot the shit seriously.
I mean, there's a bunch of different ways to do this, but I think having an exposure to like all sorts of different sources of data and perspectives seems great. And I do, I do think it's possible, um, to like curate your, your kind of intellectual influences too rigidly in virtue of some story about what matters.
Like, I think, I think it is good for people to like have space. I mean, there's, I mean, I'm, I'm really a fan of, or I, I appreciate the way, like, I don't know, I try to give myself space to, um, do stuff that is not about like, this is the most important thing.
And that's like feeding other parts of myself. And I think, um, you know, parts of yourself are not isolated.
They like feed into each other. And it's sort of, I think a better way to be a kind of richer and fuller human being in a bunch of ways.
And also just like these sorts of data can be just really directly relevant.
And I think some people, um, I know who I think of as like quite intellectually sincere and in some sense quite focused on the big picture also have a very impressive command of this very wide range of kind of empirical data.
And they're like really, really interested in the empirical trends.
And they're not just like, oh, you know, it's a philosophy or, you know, sorry, it's not just like, oh, history, it's the March of Reason or something.
No, they're like really, they're really in the weeds. I think there's a kind of in the weeds, um, oh, you know, it's a philosophy or, you know, sorry, it's not just like, oh, history, it's the march of reason or something.
No, they're like really, they're really in the weeds. I think there's a kind of in the weeds virtue that I actually think is like closely related in my head with some kind of seriousness and sincerity.
I do think there's a different dimension, which is there's like kind of trying to get it right. And then there's kind of like, throw stuff out there, right? Try to like, what if it's like this or like, try this on, or I have a hammer.
I will hit everything. Well, what if I just hit everything with this hammer? Right.
And, and so I think some people do that. And I think there is, you know, there's room for all kinds.
I kind of think the, the thing where you just get it right is kind of undervalued. Or, I mean, it depends on the context you're working in.
I think, like, certain sorts of intellectual cultures and milieus and incentive systems, I think, incentivize, you know, saying something new or saying something original or saying something, like, flashy or provocative or, and then, like, kind of various cultural and social dynamics. And, like, oh, like, mm-hmm.
You know, and people are, like, doing all these, like, kind of, you know, know, kind of performative or statusy things. Like there's a bunch of stuff that goes on when people like do thinking and, um, you know, cool, but like if something's really important, let's just get it right.
Uh, and, and I think, and sometimes it's like boring, but it doesn't matter. Uh, and I also think like, like stuff is less interesting if it's false, right? Like I think if someone's like, blah, and you're like, no, I mean, it can be useful.
I think sometimes there's an interesting process where someone says like, blah, provocative thing. And it's a kind of an epistemic project to be like, wait, why exactly do I think that's false, right? And you really, you know, someone's like, healthcare doesn't work.
Medical care does not work. Right.
Someone says that, and you're like, all right, how exactly do I know that medical care works? Right. And you like go through the process of, of trying to think it through.
And, and so I think there's like room for that, but I think ultimately like, like kind of the real profundity is like true. Right? Or like kind of things, things become less interesting if they're just not true.
And I think that's, I think sometimes it feels to me like people, or it's at least, it's at least possible, I think, to like lose, lose touch with that and to be more like flashy. And, and, and it's kind of like, eh, this actually isn't, there's, there's not actually something here, right? One thing I've been thinking about recently, after I interviewed Leopold, was, or while prepping for it, listen, I haven't really thought at all about the fact that there's going to be a geopolitical angle to this AI thing.
And it turns out, if you actually think about the national security implications, that's a big deal. Now, I wonder, given the fact that that was like something that wasn't on my radar right now, it's like, oh, obviously that's a crucial part of the picture.
How many other things like that there must be?
And so even if you're coming from the perspective of like AI is incredibly important, if you did happen to be the kind of person who's like, ah, you know, every once in a while I'm like checking out different kinds of, I'm like incredibly curious about what's happening in Beijing.
And then the kind of thing that later on you realize was like, oh, this is deal you have more awareness of you can spot it in the first place whereas I wonder so maybe there's not an exact there's not necessarily a trade-off like it's sort of like the rational thing is to have some sort of really optimal explore exploit trade-off here where you're like constantly searching things out. So I don't know if practically that's works out that well, but that that experience made me think like, oh, I really should be.
Trying to expand my horizons in a way that's undirected to begin with, because there's a lot of different things about the world you have to understand to understand any one thing. Hmm.
I mean, I think there's also room for division of labor, right? Like I think there can be, yeah, like, you know, there are people who are like trying to like draw a bunch of pieces and then be like, here's the overall picture. And then people who are going really deep on specific pieces, people who are doing the more like generative, throw things out there, see what sticks.
So I think there, it also doesn't need to be that like all of the epistemic labor is like located in one brain. And you.
And it depends on your role in the world and other things. So in your series, you express sympathy with the idea that even if an AI or I guess any sort of agent that doesn't have consciousness has a certain wish and is willing to pursue it nonviolviolently we should respect its rights to pursue that um and i'm curious where that's coming from because conventionally i think like uh the thing matters because like it's conscious and it's conscious sort of uh experience as a result of that pursued matter well i don I mean, I think that, I don't know where this discourse leads.
I just, I'm like suspicious of the amount
of like ongoing confusion that it seems to me
is like present in our conception of consciousness.
You know, I mean, so I sometimes think of analogies
with like, you know, people talk about like life
and like Alain Vitaille, right?
And maybe, you know, there's a world,
you know, Alain Vitaille was this like hypothesized life force
that is sort of the thing at stake in life. And I think, you know, we don't really use that concept anymore.
We think that's like a little bit broken. And so I don't think you want to have ended up in a position of saying like everything that doesn't have Alain Vitaille is, doesn't matter or something, right? Because then you end up later.
and then somewhat and then somewhat similarly if you even if you even if you're like no no there's no such thing
as a laundry towel
but life
surely life exists
and I'm like, yeah, life exists. I think consciousness exists too, likely, depending on how we define the terms.
I think it might be a kind of verbal question. Even once you have a kind of reductionist conception of life, I think it's possible that it kind of becomes less attractive as a moral focal point.
Right. So like right now we really think of consciousness where like, it's a deep fact.
It's like, so consider a question like, okay, so take, take a, um, uh, cellular automata, right. That is sort of self-replicating.
It has like some information that, you know, and you're like, okay, is that alive? Right. It's kind of like, it's not that interesting.
Is it kind of verbal question, right? Like, or I don't know, philosophers might get really into like, is that alive? But you're not missing anything about this system, right? It's not like, there's no extra life that's like springing up. It's just like, it's alive in some senses, not alive in other senses.
And I think if you, but I really think that's not how we intuitively think about consciousness. We think whether something is conscious is a deep fact.
It's just like additional, it's like this really deep difference between being conscious or not. It's like, is someone home? Is the lights are on? Right.
And I, I have some concern that if that turns out not to be the case, then, then this is going to have been like a bad thing to like build our entire ethics around. And so now to be clear, I take consciousness really seriously.
I'm like, I'm like, man, consciousness. I'm not one of these people like, oh, obviously consciousness doesn't exist or something.
I'm like, but I also notice how like confused I am and how dualistic my intuitions are. And I'm like, wow, this is really weird.
And so I'm just like error bars around this. Anyway, so's like one, there's a bunch of other things going on in my, like wanting to be open to, to kind of, um, not making consciousness, like there's kind of fully necessary criteria.
I mean, it clearly, like, I definitely have the intuition, like consciousness matters a ton. I think like if something is not conscious and there's like a deep difference between conscious and unconscious, then I'm like, definitely have the intuition that is sort of, there's something that matters, especially a lot about consciousness consciousness.
I'm not trying to be dismissive about the notion of consciousness. I just think we should be quite aware of how it seems to me how ongoingly confused we are about its nature.
Okay, so suppose we figure out that consciousness is just like a word we use for a hodgepodge of different things, only some of which encompass what we care about, maybe there's other things we care about that are not included in that word, similar to the life force analogy, then where do you anticipate that would leave us as far as ethics goes? Like, would then there be a next thing that's like consciousness? Or what do you anticipate that would look like? So there's a class of people who are called illusionists in philosophy mind, which, which, um, who will say consciousness does not exist. And this is sort of, um, sort of, it's different ways to understand this view.
Um, but one, one version is to sort of say that the concept of consciousness has built into it too many preconditions that aren't met by the real world. So we should sort of chuck it out like Ilan Vital, like instead of the sort of proposal is kind of like, um, at least phenomenal consciousness, right.
Or like qualia or what it's like to be a thing. Um, they'll just say this is, this is like sufficiently broken, sufficiently chock full of falsehoods that we should just not use it.
I think there, it feels to me like, I am like, there's really clearly a thing, there's something going on with, you know, like I'm kind of really not, I kind of expect to, I do actually kind of expect to continue to care about something like consciousness quite a lot on reflection and to not kind of end up deciding that my ethics is like better, like doesn't make any reference to that. Or at least like there's some things like quite nearby to consciousness.
You know, like when I stub my toe and I have this like, something happens when I stub my toe. It's unclear exactly how to name it, but I'm like, something about that, you know, I'm like pretty focused on.
And so I do think, you know, in some sense, if you're, if you feel like, well, where do things go? I'm like, I should be clear. I have a bunch of credence that in the end we end up carrying a bunch about consciousness just directly.
And so if we don't like, yeah, I mean, where will ethics go? Where will like a completed philosophy of mine go? Very hard, very hard to say. I mean, I can imagine something that's more, um, like, I think, I mean, maybe a thing that I think a move that people might make, if you get a little bit less interested in the notion of consciousness is some sort of slightly more like animistic, like, so what's going on with the tree? And you're like, maybe not like talking about it as about it as a um conscious entity necessarily but it's also not like totally unaware or something and like so there's all this like the consciousness discourse is rife with these funny cases where it's sort of like oh like those criteria imply that this um this totally weird entity would be conscious or something like that like especially if you're interested in some notion of like agency or preferences like things can be agents, corporation, you know, all sorts of things like corporations conscious.
And it's like, Oh man. Um, but I actually think it's a one place it could go in theory is in some sense, you start to view the world as like animated by moral significance in kind of richer and subtler structures than we're used to, like, than we're used to, you know? And so like plants or, um, you know, like weird optimization processes are kind of like outflows of like complex, I don't know, like who knows exactly what you end up seeing as infused with the sort of thing that you ultimately care about.
But I think it's, it is possible that that is, doesn't map, that, that, that like includes a bunch of stuff that we don't normally ascribe consciousness to. I think when you use the word a complete theory of mind, and presumably after that a more complete ethic, even the notion of a sort of reflective equilibrium implies like, oh, you'll be done with it at some point, right? Like you just, you sum up all the number, and then you've got the thing you hear about.
This might be unrelated to the same sense we have in science, but also I think like the vibe you get when you're talking about these kinds of questions is that, oh, you know, we're like rushing through all the science right now and we've been churning through it. It's getting harder to find because there's some like cap like you find all the things at some point right now it's super easy because like a semi-intelligent species barely has emerged and the asi will just rush through everything incredibly fast and like then you will either have aligned its heart or not in either case it'll use what it's figured out about like what is really going on and then
expand through the universe and exploit you know like do the tiling or maybe some more benevolent version of quote-unquote tiling that feels like the basic picture of what's going on we had dinner with michael nielsen a few months ago and his view is that this just keeps going forever or closer. How much would it change your understanding of what's going to happen in the future if you were convinced that Nielsen is right about his picture of science? Yeah, I mean, I think there's a few different aspects.
There's kind of... My memory of this conversation, I don't claim to really understand Michael's picture here, but I think my memory was it's sort of like, sure, you get the fundamental laws.
Like, I think my impression was that he expects sort of physics, the kind of physics to get solved or something, maybe modulo, like the expensiveness of certain experiments or something. But the difficulty is like, even granted that you have the kind of basic laws down, that still actually doesn't let you predict like where at the macro scale, like various useful technologies will be located.
Like there's just still this like big search problem. And so my memory though, you know, I'll let him speak for himself on what his take is here.
But my memory was, it was sort of like, sure, you get the fundamental stuff, but that doesn't mean you get the same tech.
You know, I'm not sure if that's true.
I think if that's true, what kind of difference would it make?
So one difference is that, well, so here's a question.
So like, it means at some sense you have to do, you have to, in a more ongoing way, make trade-offs between investing in further knowledge and further exploration versus exploiting, as you say, sort of acting on your existing knowledge. Because you can't get to a point where you're like, and we're done.
Now, you know, as I think about it, I mean, I think that's, you know, I sort of suspect that was always true. And like, I remember talking to someone, I think I was like, ah, we should, at least in the future, we should really get like all the knowledge.
And he's like, well, what do you want to like, you don't want to know the output of every Turing machine? Or like, you know, in some sense, it's a question of like, what actually would it be to have like a completed knowledge? And I think that's a rich question in its own right. And I think it's like, not necessarily that we should imagine, even in this sort of, on any picture necessarily, that you've got like everything.
And on any picture, in some sense, you could end up with this case where you cap out, like there's some collider that you can't build or whatever. Like there's some, something is too expensive or whatever.
And kind of everyone caps out there. So there's, I guess like one way to put it is like, so there's a question of like, do put it is like so there's a question of like do you cap and there's a question of like how contingent is the place that's right you go um if there's contingent i mean one thing one prediction that makes is you'll see more diversity across uh you know our universe or something if there are aliens they might have like quite different tech um and so maybe like, you know, if people meet,
you don't expect them to be like,
oh, you got your thing.
I got my art version.
And so she's like, whoa, like that thing.
Wow.
So that's like one thing.
If you expect more like ongoing discovery of tech,
then you might also expect like more ongoing change and like upheaval and churn insofar as like technology
is one thing that really drives kind of change in civilization. So that, that could be another, you know, people sometimes talk about like lock-in and there's like, ah, sort of, they envision this kind of point at which civilization is kind of like settled into some structure or equilibrium or something.
And maybe you get less of that if there's, I think that's maybe more about the pace rather than contingency or caps, but that's, um, that's another factor. So, yeah, I mean, I think, I think it is an interesting, I don't know if it changes the picture fundamentally of like earth civilization.
We still have to make trade-offs about how much to invest in research versus acting on our existing knowledge. Um, but I, you know, I think it has some, some significance.
I think one vibe you get when you talk to people, we're at a party and somebody mentioned this, we're talking about like how uncertain should we be with the future and they're like there are three things I'm uncertain about like what is consciousness, what is information theory and what are the basic laws of physics I think once we get that we're like we're done and that's like oh you'll figure out what's the right kind of hedonium and then like you know it has that vibe whereas this like oh you like you're like constantly shurning through and it has more of a flavor of like uh more of the becoming that like the attunement picture implies um i think it's more exciting uh like it's not just like oh you figured out the things in the 21st century and then you just, you know, you know what I mean? Yeah, I mean, I sometimes think about there's sort of two categories of views about this. Like, there are people who think like, yeah, like the knowledge, like we've almost, we're almost there and then we've like, yeah, basically got the picture.
Right? And where the picture is sort of like, yeah, the knowledge is all just totally sitting there. Yeah.
And it's like, you just have to get to like remote. There's like this kind of just, you have to be like scientifically mature at all.
That's right. And then it's just going to all fall together.
Right. And then everything past that is going to be like this super expensive, like not, not super important thing.
And then there's a different picture, which is much more of this like ongoing mystery, like ongoing, like, oh man, there's like going to be more and more like maybe expect more radical revisions to our worldview and um and i think it's an interesting uh yeah i think i you know i'm kind of drawn to both um like physics we're really we're pretty good at physics right or like a lot of our physics is like quite good at predicting a bunch of stuff and and and um or at least that's my impression i um this is you know reading some physicists so who knows your dad's a physicist though right yeah but this isn't coming from dad. This is like, there's a blog post, I think Sean Carroll or something.
And he's like, we really understand a lot of like the physics that governs the everyday world. Like a lot of it, we're like really good at it.
And I'm like, oh, I think I'm generally pretty impressed by physics as a discipline. I think that could well be right.
And so, you know, on the other hand, like, you know, really these guys, you know, had a few centuries of... So anyway, but I think that's an interesting...
Yeah.
And it leads to a different... I think it does.
There's something, you know, the endless frontier.
There is a draw to that from an aesthetic perspective
of the idea of, like, continuing to discover stuff.
You know, at the least, I think you don't...
You can't get, like, full knowledge in some sense
because there's always, like, what are you going to do? Like, there's some way in which you're part of the system so it's not clear that you the knowledge itself is part of the system and sort of like i don't know like if you imagine you're like ah you try to have full knowledge of like what the future of the universe will be like um uh well i don't know i'm not totally sure that's true it has a halting problem kind of property there's a little bit there's a little bit of a bit of a loopiness if you're... I think there are probably like fixed points in that where you could be like, yep, I'm going to do that.
And then like, you're right. But I think it's...
I at least have a question of like, are we... You know, when people imagine the kind of completion of knowledge, you know, exactly how well does that work? I'm not sure.
You had a passage in your essay on Utopia
where I think you were,
the vibe there was more of
the thing that were,
the positive future we're looking forward to
will be more of like,
you,
I'll let you describe what you meant,
but like it,
to me, it felt more like the first stuff,
like you get the thing
and then now you've like found the heart of the, maybe can I ask you to read that passage real quick? Oh, sure. And that way it will spur the discussion I'm interested in having this part in particular.
Right. Quote, I'm inclined to think that utopia, however weird, would also be in a certain sense recognizable.
That if we really understood and experienced it, we would see it, we would see in it the same thing that made us sit bolt upright long ago when we first touched love, joy, beauty. That we would feel in front of the bonfire, the heat of the ember from which it was lit.
There would be, I think, a kind of remembering. Where does that fit into this picture? I think it's a good question.
I mean, I think, um, I think it's like some guess about like, if there's like no part of me that recognize it, recognizes it as good, then I think I'm, I'm not sure that it's good according to me in some sense.
Like, so yeah, I mean, it is a question of like what it takes for it to be the case that a part of you recognizes it is good. But I think if there's really none of that, then I'm not sure it's a reflection of my values at all.
there's a sort of tautological thing you can do where it's like
if I went through the processes which led to me discovering it was good, which we might call reflection, then it was good. But by definition, you ended up there because it was like, you know what I mean? Yeah, I mean, you definitely don't want to be like, you know, if you transform me into a paper clipper gradually, right, then I will eventually be like, and then I saw the light, you know, I saw the true paper clips.
Yeah. Right.
But that's part of what's complicated about this thing about reflection. You have to find some way of differentiating between the sort of development processes that preserve what you care about and the development processes that don't.
And that is in itself is this like fraught question, which itself requires like taking some stand on what you care about and what sorts of meta processes you endorse and all sorts of things. But you definitely shouldn't just be like, it is not a sufficient criteria that the thing at the end thinks it got it right.
Right. Because that's compatible with having gone like wildly off the rails.
Yeah. There was a very interesting sentence you had in your post, one of your posts where you said, our hearts have, in fact, been shaped by power.
So we should not be at all surprised if the stuff we love is also powerful. I like, yeah, what's going on there? I actually want to think about what did you mean there? Yeah, so the context on that post is I'm talking about this hazy cluster, which I call in the essay niceness slash liberalism slash boundaries, which is this sort of like somewhat more minimal set of like cooperative norms involved in like respecting the boundaries of others and kind of cooperation and peace amongst differences and like tolerance and stuff like that, as opposed to like your favored structure of matter, um, which is sort of sometimes the paradigm of, of like values that, that, uh, people, um, people use in the context of, of, um, AI risk.
And, you know, I talk for a while about the sort of ethical virtues of, of these like norms, but it's, it's pretty clear that also like, why do we have these norms? Like, well, one important feature of these norms is that they're kind of effective and powerful. Like liberal societies are, you know, like, you know, secure boundaries, save resources, wasted on conflict.
Right. And like liberal societies are often more like, you know, they're better to live, live in, they're better to immigrate to, they're more productive, like all sorts of things.
Nice people, they're better to interact with, they're better to like trade with, all sorts of things, right? And I think it's pretty clear if you look at the both like why at a political level do we have like various political institutions. And if you look kind of more deeply into our evolutionary past and like how our moral cognition is structured, it seems like pretty clear that various like kind of forms of cooperation and like kind of game theoretic dynamics and other things went into kind of shaping what we now, at least in certain contexts, also treat as a kind of intrinsic or terminal value.
So like some of these values that have kind of instrumental functions in our society are also get kind of reified in our cognition as kind of intrinsic values in themselves. And I think that's okay.
I don't think that's a debunking. Like all of your values are kind of like something that kind of stuck and got kind of treated as terminally important, but I think that means that, uh, you know, sometimes, sometimes the way in the context of the series where I'm talking about like deep atheism and our sort of relationship, the relationship between what we're pushing for and what like nature is pushing for or what sort of pure power we'll push for.
Um, and it's easy to say like, well, there's like paperclips, which is just like one way place you can steer. And, you know, pleasure is just like another place you can steer or something.
And these are just sort of arbitrary directions. Whereas I think like some of our other values are much more structured around like cooperation and things that also are kind of effective and functional and like so that's, that's what I mean there is I think there's, there's a way in which we're, we're sort of nature is a little bit more on our side than you might think because like part of who we are is like, has been made by a kind of nature's way.
Um, and so, uh, that, that is like in us now, I don't think that's enough necessarily, uh, you know, for us to beat the gray goo, right? Like we have some amount of like power built into our values, but that doesn't mean it's kind of going to be such that it's kind of arbitrarily competitive. But I think it's still important to keep in mind that this is, and I think it's important to keep in mind in the context of integrating AIs into our society that I think, you know, we've been talking a lot about the ethics of this, but I think there's also, there are like instrumental and kind of practical reasons to want to have like forms of social harmony and like cooperation with AIs with different values.
And I think we need to be taking that seriously and thinking about what is it to do that in a way that's like genuinely kind of legitimate and kind of a project that is sort of a kind of just incorporation of these beings into our civilization such that they can kind of all, or sorry, there's like the justice part. And there's also the kind of, is it like kind of compatible with like people, you know, is it a good deal? Is it a good bargain for people? And I think this is, you know, this is often how, you know, to the extent we're kind of very concerned about AI is like kind of rebelling or something like that.
It's like, well, there's like a lot of, you know, part of a thing you can do is make civilization better for someone, right? So it's like, and I think that's, that's an important feature of how we have in fact structured a lot of, a lot of our political institutions and norms and stuff like that. So that's the thing I'm getting, getting at in that, in that quote.
Okay. I think that's an excellent place to close.
Great. Thank you so much.
Joe, thanks so much for coming on the podcast. We discussed the ideas in the series.
I think people might not appreciate if they haven't read the series how beautifully written it is. It's just like the ideas, we didn't cover everything.
There's a bunch of very, very interesting ideas. As somebody who has talked
to people about AI for a while,
things I haven't encountered
anywhere else,
but just obviously no part
of the AI discourse
is nearly as well written.
And it is a genuinely
beautiful experience
to listen to the podcast version,
which is in your own voice.
So I highly recommend
people do that.
So it's joecarlsmith.com where they can access this. Joe, thanks so much for coming on the podcast.
Thank you for having me. I really enjoyed it.
And also, if you can leave a good rating on Apple Podcasts or wherever you listen, that's really helpful. Helps other people find the podcast.
If you want transcripts of these episodes or what you want to give my blog post, you can subscribe to my sub stack at DwarkeshPatel.com. And finally, as you might have noticed, there's advertisements on this episode.
So if you want to advertise on a future episode, you can learn more about doing that at dwarkeshpatel.com slash advertise or the
link in the description. Anyways, I'll see you on the next one.
Thanks.