Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

March 28, 2024 3h 12m

Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast.

No way to summarize it, except:

This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them.

You would be shocked how much of what I know about this field, I've learned just from talking with them.

To the extent that you've enjoyed my other AI interviews, now you know why.

So excited to put this out. Enjoy! I certainly did :)

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform.

There's a transcript with links to all the papers the boys were throwing down - may help you follow along.

Follow Trenton and Sholto on Twitter.

Timestamps

(00:00:00) - Long contexts

(00:16:12) - Intelligence is just associations

(00:32:35) - Intelligence explosion & great researchers

(01:06:52) - Superposition & secret communication

(01:22:34) - Agents & true reasoning

(01:34:40) - How Sholto & Trenton got into AI research

(02:07:16) - Are feature spaces the wrong way to think about intelligence?

(02:21:12) - Will interp actually work on superhuman models

(02:45:05) - Sholto’s technical challenge for the audience

(03:03:57) - Rapid fire

Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Download Audio Original Episode

Listen and Follow Along

Speed:

Full Transcript

Okay, today I have the pleasure to talk with two of my good friends, Shoto and Trenton. Shoto...
It's just mixed up. Actually we did that.
I wasn't going to say anything. Let's do this in reverse.
How long have I started with my good friends? Yeah, I remember at one point caught the context like is just wow. Shit.
Anyways, Sholto, Noam Brown, Noam Brown, the guy who wrote the diplomacy paper, he said this about Sholto. He said he's only been in the field for 1.5 years, but people in AI know that he was one of the most important people behind Gemini's success.
And Trenton, who's an anthropic, works on mechanistic interoperability, and it was widely reported that he has solved alignment. So this will be a capabilities only podcast.
Alignment is already solved, so no need to discuss further. Okay, so let's start by talking about context links.
Yep. It seemed to be under-hyped, given how important it seems to me to be that you can just put a million tokens into context.
There's apparently some other news that, you know, got pushed to the front for some reason. But, yeah, tell me about how you see the future of long context links and what that implies for these models.
Yeah, so I think it's really under hype because until I started working on it, I didn't really appreciate how much of a step up in intelligence it was for the model to have the onboarding problem basically instantly solved. And you can see that a little bit in the perplexity graphs in the paper where just throwing millions of tokens worth of context about a code base allows it to become dramatically better at predicting the next token in a way that you'd normally associate with huge increments and model scale.
But you don't need that. All you need is a new context.
So, underhyped and buried by some other news. In context, are they as sample efficient and smart as humans? I think that's really worth exploring.
For example, one of the evals that we did in the paper has it learning a language in context better than a human expert could learn that new language over the course of a couple months. This is only a pretty small demonstration, but I'd be really interested to see things like Atari games or something like that, where you throw in a couple hundred or thousand frames labeled actions in the same way that you show your friend how to play a game and see if it's able to reason through.
It might, at the moment, with the infrastructure and stuff, it's still a little bit slow doing that. But I would actually, I would guess that might just work out of the box in a way that would be pretty mind-blowing.
And crucially, I think this language was esoteric enough that it wasn't in the training day. Right, exactly.
Yeah, if you look at the model before, it has that, it just, it doesn't know the language at all and it can't get any translations. And this is like an actual human language.
Yeah, exactly. An actual human language.
So if this is true, it seems to me that these models are already in an important sense, superhuman. Not in the sense that they're smarter than us, but I can't keep a million tokens in my context when I'm trying to solve a problem, remembering and integrating all the information into our code base.
Am I wrong in thinking this is a huge unlock? I actually generally think that's true. Previously I've been frustrated when models aren't as smart.
You ask them a question and you want it to be smarter than you or to know things that you don't. And this allows them to know things that you don't in a way that just ingests a huge amount of information in a way you just can't.
So, yeah, it's extremely important. How do we explain in-context learning? Yeah.
So there's a line of work I quite like where it looks at in-context learning as basically very similar to gradient descent like the attention operation can be viewed as gradient descent

on the in-context data.

That paper had some cool plots

where it basically showed like

we take N steps of gradient descent

and that looks like N layers of in-context learning

and it looks very similar.

So I think like that's one way of viewing it

and trying to understand what's going on.

Yeah.

And you can ignore what I'm about to say

because given the introduction,

alignment is solved and ice safety isn't a problem.

But I think the context stuff does get problematic, but also interesting here. You can ignore what I'm about to say because given the introduction, alignment is solved and ice safety isn't a problem.

But I think the context stuff does get problematic, but also interesting here. I think there'll be more work coming out in the not too distant future around what happens if you give a hundred shot prompt for jailbreaks, adversarial attacks.
it's also interesting in the sense of if your model is doing gradient descent

and learning on the fly

even if it's been trained to be harmless, you're dealing with a totally new model in a way. You're like fine tuning, but in a way where you can't control what's going on.
Can you explain what do you mean by gradient descent is happening in the forward pass and attention? Yeah, No, no, no. There was something in the paper about trying to teach the model to do linear regression.
Right. But like just through the number of samples they gave in the context.
Yeah. And you can see if you plot on the x-axis like number of shots that it has or examples.
And then like the loss it gets on just like ordinary least squares regression. Yeah.
That will go down with time. And it goes down exactly matched with number of gradient descent steps.
Yeah, exactly. Okay.
I only read the intro and discussion section of that paper, but in the discussion, the way they framed it is that in order to get better at long context tasks, the model has to get better at learning to learn from these examples or from the context that is already within the window. And that the implication of that is the model learned, if like meta learning happens, because it's has to learn how to get better long context tasks, then in some important sense, the task of intelligence is like requires long context examples and long context training, like metal, like you have to induce meta learning right understanding how to better induce meta learning in your pre-training process is like a very important thing to actually about flexible or adaptive intelligence right but you can proxy for that just by getting better at doing long context tasks um one of the bottlenecks for ai progress that many people identify is the inability of these models to perform tasks on long horizons, which means engaging with the task for many hours or even many weeks or months where like if I have, I don't know, an assistant or an employee or something, they can just do a thing and tell them for a while.
And AI agents haven't taken off for this reason, from what I understand. So how linked are long context windows and the ability to perform well on them and the ability to do these kinds of long horizon tasks that require you to engage with an assignment for many hours or these unrelated concepts? I mean, I would actually take issue with that being the reason that agents haven't taken off, where I think that's more about like nines of reliability and the model actually successfully doing things.
And if you just can't chain tasks successfully with high enough probability, then you won't get something that looks like an agent. And that's why something like an agent might follow more of a step function.
Like GPT-4 class models, Gemina Ultra class models, they're not enough. But maybe the next increment on model scale means that you get that extra nine, even though the loss isn't going down that dramatically.
That small amount of extra ability gives you the extra. And obviously you need some amount of context to fit long horizon tasks, but I don't think that's been the limiting factor up to now.
Yeah. The NURPS best paper this year by Ryland Schaefer was the lead author, points to this as like the emergence of mirage where people will have a task and you get the right or wrong answer depending on if you've sampled the last five tokens correctly.
And so naturally that's, you're multiplying the probability of sampling all of those. And if you don't have enough nines for reliability, then you're not going to get emergence.
And all of a sudden you do. And it's like, oh my gosh, this ability is emergent when actually it was kind of almost there to begin with.
And there are ways that you can find like a smooth metric for that. Yeah, human eval or whatever.
In the GPT-4 paper, the coding problems, they measure. Log pass, right? Exactly.
For the audience, the context on this is, basically the idea is you want to, when you're measuring how much progress there has been on a specific task like solving coding problems, you upweighted when it gets it right only one in a thousand times. You don't give it a one in a thousand score because it's like, oh, got it right some of the time.
And so the curve you see is it gets it right one in a thousand, then one in a hundred, then one in ten, and so forth. So actually, I want to follow up on this.
So if your claim is that the AI agents haven't taken off because of reliability rather than long horizon task performance, isn't the lack of reliability when a task is changed on top of another task on top of another task? Isn't that exactly the difficulty with long horizon tasks is that like you have to do 10 things in a row or 100 things in a row and diminishing the reliability of any one of them uh or yeah the probability goes down from 99.99 to 99.9 then like the whole thing gets multiplied together and the whole things becomes much less likely to happen that that is exactly the problem but the the key issue you're pointing out there is that your base task solve rate is 90%. And if it was 99%, then

chaining them doesn't become a problem.

Yeah, exactly.

And I think this is also something that just hasn't been

properly studied enough. If you look at all of the evals

that are commonly, like the academic evals,

a single problem.

The math problem, it's like one

typical math problem or MMORPG.

It's like one university level

problem from across different topics. You are beginning to start to see evals looking at this properly via more complex tasks like SweetBench where they take a whole bunch of GitHub issues and that is like a reasonably long horizon task but it's still not a multi, it's like a multi sub hour as opposed to like multi hour or multi day task.
And so I think one of the things that will be really important to do over the next however long is understand better what does success rate over long horizon tasks look like. And I think that's even important to understand what the economic impact of these models might be and like actually properly judge increasing capabilities by like cutting down the tasks that we do and the inputs and outputs involved into minutes or hours or days and seeing how good it is at successively chaining and completing tasks at those different resolutions of time.
Because then that tells you how automatable a job family or task family is in a way that MMO use scores don't. I mean, it was less than a year ago that we introduced 100K context windows.
And I think everyone was pretty surprised by that. So yeah, everyone just kind of had this soundbite of quadratic attention costs yeah we can't have long context windows and uh here we are so yeah like the benchmarks are being actively made wait wait so what doesn't the fact that there's these companies google and i don't know magic maybe others who have million token attention imply that the could drive you shouldn't say anything but doesn't that imply that it's not quadratic anymore or are they just eating the cost? Well, who knows what Google is doing for its long context scheme.
I'm not saying it's any there. One of the things that frustrated me about the general research field's approach to attention is that there's an important way in which the quadratic cost of attention is actually dominated in typical dense transformers by the mlp block right um so you have this n-squared term that's associated with attention but you also have an n-squared term that's associated with the d model the residual stream dimension of the model and if you look uh i think sasha rush has a great uh tweet where he looks like basically plots the curve of the cost of attention, respective to the cost of really large models, and attention actually trails off.
And you actually need to be doing pretty long contexts before that term becomes really important. And the second thing is that people often talk about how attention at inference time is such a huge cost, right? And if you think about when you're actually generating tokens, the operation is not N-squared.
It is one Q, like one set of Q vectors, looks up a whole bunch of KV vectors, and that's linear with respect to the amount of context that the model has. And so I think this drives a lot of the recurrence and state space research where people have this meme of linear attention and all this stuff.
And as Trenton said, there's like a graveyard of ideas around attention. And not to think I don't think it's worth exploring, but I think it's important to consider where the actual strengths and weaknesses of it are.
Okay, so what do you make of this take? As we move forward through the takeoff, more and more of the learning happens in the forward pass. So originally, like all the learning happens in the backward, you know, during like this, like bottom up sort of hill climbing evolutionary process.
If you think in the limit during the intelligence explosion, the AI is maybe handwriting the

weights or doing GoFi or something.

And we're in the middle step where a lot of learning happens in context now with these

models. A lot of it happens within the backward pass.
Does this seem like a meaningful gradient along which progress is happening? Because the broader thing being, if you're learning in the forward pass, it's much more sample efficient because you can basically think as you're learning. Like when humans, when you read a textbook, you're not just skimming it and trying to absorb what, you know, what inductive, these words follow these words.
You like read it and you think about it and then you read some more, you think about it. I don't know.
Does this seem like a sensible way to think about the progress? Yeah. It may just be one of the ways in which like, you know, birds and planes like fly like fly but they fly differently and like the virtue of technology allows us to do that like like i actually accomplish things that birds can't um it might be that context length is similar in that uh it allows it to have a working memory that we can't uh but functionally is not like the key thing towards actual reasoning the key step between gpd2 and gpd3 was that all of a sudden, there was this meta-learning behavior that was observed in training, like in the pre-training of the model.
And that's, as you said, it's something to do with you give it some amount of context, it's able to adapt to that context, and that was a behavior that wasn't really observed before that at all. Maybe that's a mixture of property of context and scale and this kind of stuff.
But it wouldn't have occurred to model a tiny context, I would say. This is actually an interesting point.
So when we talk about scaling up these models, how much of it comes from just making the models themselves bigger? And how much comes from the fact that during any single call you are using more compute? So if you think of diffusion, you can just iter you are using more compute.

So if you think of diffusion,

you can just iteratively keep adding more compute.

And if adaptive compute is solved,

you can keep doing that.

And in this case,

if there's a quadratic penalty for attention,

but you're doing long context anyways,

then you're still dumping in more compute not during training,

not during having bigger models,

but just like, yeah.

Yeah, it's interesting because you do get more forward passes by having more tokens right um my one gripe i guess i have two gripes with this though maybe three so one like in the alpha in the alpha gold paper um one of the transformer modules they have a few and the architecture is like very intricate yeah um but they do i think five forwards passes through it and will gradually refine their solution as a result. You can also think of the residual stream.
I mean, Shalter alluded to the read-write operations as a poor man's adaptive compute where it's like, I'm just going to give you all these layers and if you want to use them, great. If you don't, then that's also fine.
And then people will be like, oh, well, the brain is recurrent and you can do however many loops through it it you want. And I think to a certain extent, that's right.
Right. Like if I ask you a hard question, you'll spend more time thinking about it.
And that would correspond to more forward passes. But I think there's a finite number of forward passes that you can do.
It's kind of with language as well. People are like, oh, well, human language can have like infinite recursion in it, like infinite nested statements of like the boy jumped over the bear that was doing this, that had done this, that had done that.
But like empirically, you'll only see five to seven levels of recursion, which kind of relates to whatever that magic number of like how many things you can hold in working memory at any given time is. And so, yeah, it's not infinitely recursive, but like like does that matter in the regime of human intelligence and like can you not just add more layers breakdown for me you're referring to this in some of your previous answers of listen you have these long contexts and you can hold more things in memory but like ultimately comes down to your ability to mix concepts together to do some kind of reasoning uh and these models aren't necessarily human level at that even in context break down for me how you see storing just raw information versus reasoning and what's in between like where's the reasoning happening is that uh where's just like storing raw information happening what's different between them in? Yeah, I don't have a super crisp answer for you here.
I mean, obviously, with the input and output of the model, you're mapping back to actual tokens, right? And then in between that, you're doing higher level processing. Before we get deeper into this, we should explain to the audience, you referred earlier to Anthropix's way of thinking about transformers as these read-write operations that layers do.
One of you should just kind of explain at a high level what you mean by that. So the residual stream, imagine you're in a boat going down a river.
And the boat is kind of the current query where you're trying to predict the next token. So it's the cat sat on the blank.
Right. And then you have these little streams that are coming off the river where you can get extra passengers or collect extra information if you want.
And those correspond to the attention heads and MLPs that are part of the model. Right.
I almost think of it like the working memory of the model. Yeah.
Like the RAM of computer, where you're choosing what information to read in so you can do something with it, and then maybe you read something else in later on. And you can operate on subspaces of that high-dimensional vector.
A ton of things are, I mean, at this point, I think it's almost given that are encoded in superposition. So it's like, yeah, the residual stream is just one high dimensional vector, but actually there's a ton of different vectors that are packed into it.
Yeah. I might like just like dumb it down, like as a way that would have made sense to me a few months ago of, okay, so you have, you know, whatever words are in the input you put into the model, all those words get converted into these tokens and those tokens get converted into these vectors.
And basically it's just like this small amount of information that's moving through the model. And the way you explained it to me, Sheldon, that this paper talks about is early on in the model, maybe it's just doing some very basic things about like, what do these tokens mean? Like if it says like 10 plus five, just like moving information about to have that.
Good representation. Exactly, just represent.
And in the middle, maybe like the deeper thinking is happening about like how to think, yeah, how to solve this. At the end, you're converting it back into the output token because the end product is you're trying to predict the probability of the next token from the last of those residual streams.
And so, yeah, it's interesting to think about like just like the small compressed amount of information moving through the model and it's like getting modified in different ways. Trenton, so you're, it's interesting, you're one of the few people who have like background from neuroscience.
You can think about the analogies here to, yeah, to the brain. And in fact, I have one of our friends one of our friends, Hugh had a paper in grad school about thinking about attention in the brain.
And he said, this is the only or first neural explanation of why attention works. Whereas we have evidence from why the CNN's work, convolutional neural networks work based on the visual cortex or something.
Yeah, I'm curious, do you think in the brain there's something like a residual stream of this compressed amount of information that's moving through and it's getting modified as you're thinking about something? Even if that's not what's literally happening, do you think that's a good metaphor for what's happening in the brain? Yeah, yeah. in the cerebellum you basically do have a residual stream where um the whole what we'll call the attention module for now and i can go into whatever amount of dc you want for that um you have inputs that route through it but they'll also just go directly to the like end point that that that module will contribute to so there's a direct path and an indirect path um and and so the model can like pick up whatever information it wants and then add that back in um what what happens to cerebellum uh so the cerebellum nominally just does find motor control um but i analogize this to the um person who's lost their keys and is just looking under the streetlight where it's very easily to observe this behavior.
One leading cognitive neuroscientist said to me that a dirty little secret of any fMRI study where you're looking at brain activity for a given task is that the cerebellum is almost always active and lighting up for it. If you have a damaged cerebellum, you also are much more likely to have autism.
So it's associated with like social skills. In one of these particular studies where I think they use PET instead of fMRI, but when you're doing next token prediction, the cerebellum lights up a lot.
Also 70% of your neurons in the brain are in the cerebellum. They're small, but they they're there and they're taking up real metabolic cost this was one of glern's points that like what changed with humans was not just that we have more neurons or he says he shared this article um but specifically there's more neurons in the cerebral cortex in the cerebellum and you should say more about this but like they're they're more effectively expensive and they're more involved in signaling and sending information back and forth.
Yeah. Is that attention? What's going on? Yeah.
Yeah. So I guess the main thing I want to communicate here.
So back in the 1980s, Pente Canerva came up with a associative memory algorithm for I have a bunch of memories. I want to store them.
There's some amount of noise or corruption that's going on. And I want to query or retrieve the best match.
And so he writes this equation for how to do it. And a few years later, realizes that if you implemented this as an electrical engineering circuit, it actually looks identical to the core cerebellar circuit.
And that circuit and the cerebellum more broadly is not just in us. It's in basically every organism.
There's active debate on whether or not cephalopods have it. They kind of have a different evolutionary trajectory.
But even fruit flies with the Drosophila mushroom body, that is the same cerebellar architecture. And so that convergence and then my paper, which shows that, this operation is to a very close approximation, the same as the attention operation, including implementing the softmax and having this sort of like nominal quadratic cost that we've been talking about.
And so the three way convergence here and the takeoff and success of transformers seems pretty striking to me. Yeah, I want to zoom out and ask, I think what motivated this discussion in the beginning was we were talking about like, wait, what is the reasoning? What is the memory? What do you think about the analogy you found to attention and this? Do you think of this as more just looking up the relevant memories or the relevant facts? And if that is the case, like, where reasoning happening in the brain like yeah how do we think about like how that builds up into the reasoning yeah so maybe my hot take here i don't know how hot it is is that like most most intelligence is pattern matching and you can do a lot of really good pattern matching if you have a hierarchy of associated memories.

So you start with your very basic associations between just like objects in the real world. But you can then chain those and have more abstract associations, such as like a wedding ring symbolizes like so many other associations that are downstream.
And you can even generalize the attention operation and this associative memory as the MLP layer as well.

It's in a long-term setting where you don't have tokens

in your current context.

But I think this is an argument that association is all you need.

And associative memory in general as well. so you can do two things with it um you can both denoise or retrieve a current memory so like if i see your face but it's like raining and cloudy i can i can denoise and kind of like gradually update my query towards my memory of your face but i can also um access memory.
And then the value that I get out actually points to some other totally different part of the space. And so you, so a very simple instance of this would be if you learn the alphabet, right? And so I query for a and it returns B I query for B and it returns C and, and you can traverse the whole thing.
Yeah. Yeah.
One of the things I talked to Demis about was he had a paper in 2008 that memory and imagination are very linked because of this very thing that you mentioned of memory is reconstructive and so you are in some sense imagining every time you're thinking of a memory because you're only storing a condensed version of it and you're like have to and this is famously why human memory is terrible and like why people in the witness box or whatever would just make shit up um okay so let me ask a stupid question so you like read Sherlock Holmes right and like the guy is incredibly sample efficient you'll he'll like see a few observations and he'll like uh uh basically figure out who committed the crime because there's a series of deductive steps that leads from somebody's tattoo and what's on the wall to the implications of that. How does that fit into this picture? Because like crucially, what makes him smart is that there's not like an association, but there's a sort of deductive connection between different pieces of information.
Would you just explain it as that's just like higher level association? Like, yeah. I think so, yeah.
So I think learning these higher level associations to be able to then map patterns to each other as kind of like a meta learning. I think in this case, he would also just have a really long context length or a really long working memory, right? Where he can like have all of these bits and continuously query them as he's coming up with whatever theory.

So the theory is moving through the residual stream.

And then his attention heads are querying his context.

But then how he's projecting his query and keys in the space

and how his MLPs are then retrieving like longer term facts or modifying that information is allowing him to then in later layers do even more sophisticated queries and slowly be able to reason through and come to a meaningful conclusion. That feels right to me in terms of like looking back in the past, you're selectively reading in certain pieces of information, comparing them, maybe that informs your next step of like what piece of information you now need to pull in.

And then you build this representation, which I like progressively looks closer and closer and closer to like the suspect in your case. Yeah.
Yeah. That doesn't feel at all outlandish.
Do you know what the lens on like suspects? Well, something I think that the people who aren't doing this research can overlook is after your first layer of the model, every query key and value that you're using for attention comes from the combination of all the previous tokens. So like my first layer, I'll query my previous tokens and just extract information from them.
But all of a sudden, let's say that I attended to tokens one, two, and four in equal amounts, then the vector in my residual stream, assuming that they just, they wrote out the same thing to the value vectors, but ignore that for a second, is a third of each of those. And so when I'm querying in the future, my query is actually a third of each of those things.
And so... But they might be written to different subspaces.
That's right. Hypothetically, but they wouldn't have to.
And so you can recombine and immediately, even by layer two and certainly by the deeper layers, just have like these very rich vectors that are packing in a ton of information. And the causal graph is like literally over every single layer that happened in the past.
And that's what you're operating on. Yeah.
It does bring to mind like a very funny eval to do would be like a sherlock holmes eval let's see put the entire book into context and then you have like a sentence which is like the suspect is like x then you have like a logic probability distribution over like the different characters yeah yeah um and then like as you put more like that would be that would be super cool yeah i wonder if you'd get anything at all that'd'd be cool. Sherlock Holmes is probably already in the training data.
Right. You've got to get a mystery novel that was written in the...
You can get an LLM to write it. Or we could purposely exclude it.
We can. Well, you need to scrape any discussion of it from Reddit or any other thing, right? Right.
It's hard. But that's one of the challenges that goes into things like long context evals is to get a good one.
you need to know that it's not in your training data you like put in the effort to exclude it what um so i actually want to there's two different threads i want to follow up on let's go to the long context one and then we'll come back to um this so in the gemini 1.5 paper the eval that was used was can it Paul Graham essays, can it like remember stuff? Yeah, the needle in a haystack. Right.
Which, yeah, I mean, there's like, we don't necessarily just care about its ability to recall one specific fact from the context. I'll step back and ask the question, like the loss function for these models is unsupervised.
You don't have to like come up with these bespoke things that you keep out of the training data. Is there a way you can do a benchmark that's also unsupervised where, I don't know, another LLM is rating it in some way or something like that? And maybe the answer is like, well, if you could do this, reinforcement learning would work because then you have this unsupervised.
Yeah, I mean, I think people have explored that kind of stuff. Like, for example, Anthropica is the constitutional URL paper, where they take another language model, and they point it and say, like, how, you know, helpful or harmless was that response? And then they get it to update and try and, you know, improve along the period of frontier of helpfulness and harmfulness.
So you can, like, point language models at each other and create evals in this way it's obviously an imperfect art form at the moment um because you get reward function hacking basically um and the language what like uh if you try and match up to what even humans are imperfect here like if you try and match up what humans will say humans typically prefer longer answers which aren't necessarily better answers and you get that same behavior with models um on the other because of the other thread going back to the sherlock holmes thing if it's all associations all the way down this is a sort of like naive dinner party question if i just like match you i'm working on ai um but okay does that mean we should be less worried about super intelligence because there's not's not this sense in which it's like Sherlock Holmes plus plus. It'll still need to just like find these associations, like humans find associations.
And like, you know what I mean? It's not just like it sees a frame of the world and it's like figured out all the laws of physics. So for me, because this is a very legitimate response, right? It's like, well, artificial general intelligence intelligence aren't if you say humans are generally intelligent then they're no more capable or competent i'm just worried that you have that level of general intelligence in silicon where you can then immediately clone hundreds of thousands of agents and they don't need to sleep and they can have super long context windows and then they can start recursively improving and then things get really scary um so so i think to answer your original question yes you're right they would still need to learn associations but well but the recursive stuff improvement would still have to be them like if intelligence is fundamentally about these associations like the improvement is just them getting better at association there's not like another thing that's happening and so then it seems like you might disagree with the intuition that, well, they can't be that much more powerful if they're just doing associations.
Well, I think then you can get into really interesting cases of meta learning. Like when you play a new video game or like study a new textbook, you're bringing a whole bunch of skills to the table to form those associations much more quickly.
And like, because everything in some way ties back to the physical worlds, I think there are general features that you can pick up and then apply in novel circumstances. Should we talk about intelligence explosion then? I don't know if it's a good one.
I mentioned multiple agents and I'm like, oh, here we go. The reason I'm interested in discussing this is with you guys in particular is the models we have of the intelligence explosion so far come from economists which is fine but I think we can do better because the very like in the model of the intelligence explosion what happens is you replace the AI researchers and then there's like a bunch of automated AI researchers who can speed up progress, make more AI researchers, make further progress.
And so I feel like if that's the metric or that's the mechanism, we should just ask the AI researchers about whether they think this is plausible. So let me just ask you, like if I have a thousand Asian chotos or Asian Trentons, are they just, you get an intelligence explosion? What does that look like to you? I think one of the important bounding constraints here is compute.
I do think you could dramatically speed up AI research. It seems very clear to me that in the next couple of years, we'll have things that can do many of the software engineering tasks that I do on a day-to-day basis and therefore dramatically speed up my work and therefore speed up the rate of progress.
At the moment, I think most of the labs are somewhat compute bound in that there are more experiments you could run and more pieces of information that you could gain in the same way that scientific scientific research on biology is also somewhat experimentally like throughput bound like you need to be able to run and culture the cells in order to get the information i think that will be at least a short-term planning constraint obviously you know sam's trying to raise seven trillion dollars to get chips and so um like it does seem like there's going to be a lot more compute in future as everyone is heavily ramping your nvidia's stock price sort of represents the relative uh compute increase um but any thoughts i think we need a few more nines of reliability um in order for it to really be useful and trustworthy right now it's like and and just having context lengths that are super long and it's like very cheap to have uh like if i'm working in our code base um it's really only small modules that i can get claude to write for me right now um but it's very plausible that within the next few years um or even sooner uh it can automate most of my task. The only other thing here that I will note is the research that at least our sub team in interpretability is working on is so early stage that you really have to be able to make sure everything is done correctly in a bug-free way and contextualize the results with everything else in the model.
And if something isn't going right, be able to enumerate all of the possible things and then slowly work on those. Like an example that we've publicly talked about in previous papers is dealing with layer norm, right? And it's like, if I'm trying to get an early result or look at like the logit effects of the model, right?

So it's like if I activate this feature that we've identified to a really large degree, how does that change the output of the model? Am I using layer norm or not? How is that changing the feature that's being learned? And that will take even more context or reasoning abilities for the model. so you used a couple of concepts together and it's not self-evident to me that they're the same, but you, it seems like you were using them interchangeably.
So I just want to, um, like, uh, one was, well, to work on the cloud code base and make more modules based on that they need more context or something where like, it seems like they might already be able to fit in the context.. Do you mean like actual, do you mean like the context window context or like more? Yeah, the context window context.
So yeah, it seems like now it might just be able to fit. The thing that's preventing it from making good modules is not the lack of being able to put the code base in there.
I think that will be there soon. Yeah.
But like it's not going to be as good as you at like coming up with papers because it can like fit the code base in there. No, but it'll speed up a lot of the engineering.
In a way that causes an intelligence explosion? No, that accelerates research. But I think these things compound.
So the faster I can do my engineering, the more experiments I can run. And then the more experiments I can run, the faster we can...
I mean, my work isn't actually accelerating capabilities at all. Right, right but it's like interpreting the models but but we have a lot more work to do on that surprise to the twitter yeah I mean for context like when you released your paper there was a lot of talk on twitter about alignment to solve guys close the curtains yeah yeah no it's it keeps me up at night how quickly the models are becoming more capable and like just how poor our understanding still is of what's going on um yeah i i guess i'm still okay so let's think through the specifics here by the time this is happening we have bigger models that are two to four orders of magnitude bigger right right? Or at least an effective compute are two to four orders of magnitude bigger.
And so this idea that, well, you can run experiments faster or something, you're having to retrain that model in this version of the intelligence explosion. The recursive self-improvement is different from what might have been imagined 20 years ago, you just rewrite the code you actually have to train a new model and that's really expensive not only now but especially in the future as you keep like making these models orders of magnitude bigger doesn't that dampen the possibility of a sort of recursive self-improvement type intelligence explosion it's definitely going to act as a breaking mechanism.
Like, I agree that the world of what we're making today looks very different to what people imagined it would look like 20 years ago. Like, it's not going to be able to write its own code to be really smart because actually it needs to train itself.
Like, the code itself is typically quite simple, typically pretty small and self-contained. I think John Carmack had this nice phrase where it's like the first time in history where you can actually plausibly imagine writing AI with 10,000 lines of code.
And that actually does seem plausible when you pair most training code bases down to the limit. But it doesn't take away from the fact that this is something we should really strive to measure and estimate how progress might occur.
We should be trying very, very hard right now to measure exactly how much of a software engineer's job is automatable and what the trend line looks like and be trying out the hardest to project out those trend lines. But with all due respect to software engineers, you are not writing a React front end, right? Right Right.
So it's like, I don't know how this, like what is concretely happening? And maybe you can walk me through, walk me through like a day in the life of, like you're working on an experiment or project that's going to make the model quote-unquote better. Right.
Like what is happening from observation to experiment to theory to like writing the code? What is happening? And so I think important to contextualize here is that, like, I've primarily worked on inference so far. So a lot of what I've been doing is just taking or helping guide the pre-training process, socially design a good model for inference, and then making the model and, like, the surrounding system faster.
I've also done some pre-training work around that, but it hasn't been my 100% focus. But I can still describe what I do when I do that work.
I know, but sorry, let me interrupt and say in Carl Schulman's, when he was talking about it on the podcast, he did say that things like improving inference or even literally helping it make better chips or GPUs, that's part of the intelligence explosion. Yeah.
Because obviously if the inference code runs faster like it happens better or faster or whatever right anyway sorry go ahead yeah um okay so what is what does concretely a day look like um i think the most important like part to illustrate is this cycle of coming up with an idea proving it out at different points in scale, and interpreting and understanding what goes wrong. And I think most people would be surprised to learn just how much goes into interpreting and understanding what goes wrong.
Because the ideas, people have long lists of ideas that they want to try. Not every idea that you think should work will work, and trying to understand why that is is quite quite difficult.
And like working out what exactly you need to do to interrogate it. So, so much of it is like introspection about what's going on.
It's not pumping out thousands and thousands and thousands of light of code. It's not like the difficulty in coming up with ideas even.
I think many people have a long list of ideas that they want to try. But paring that down and shock calling under very imperfect information what the right ideas to explore further is really hard tell me more about what do you mean by imperfect information are these early experiments are these like what is the information that you're um so so demas mentioned this in his podcast and also like you obviously it's like the gpd4 paper where you have like scaling law increments And you can see like in the GPT-4 paper, they have like a bunch of like dots, right? Where they say we can estimate the performance of our final model, like using all of these dots.
And there's a nice curve that like flows through them. And Demas mentioned that we do this process of scaling up.
Concretely, why is that imperfect information? Is you never actually know if the trend will hold. For certain architectures, the trend has held really well.
And for certain changes, it's held really well. But that isn't always the case.
And things which can help at smaller scales can actually hurt at larger scales. So making guesses based on what the trend lines look like and based on your intuitive intuitive feeling of okay this is actually something that's going to matter um particularly for those ones which help at the small scale that's interesting to consider that for every chart you see in a release paper technical report that shows that smooth curve there's a graveyard of like first few runs and then it's like flat yeah there's all these like other lines that go in like different directions like tail off that's it's yeah it's crazy both like as a grad student and then also here like the number of experiments that you have to run before getting like a meaningful result um tell me okay so you but presumably it's not just like you run it until it stops and then like let's go to the next thing um there's some process by which to interpret the early data and also to look at your, like I don't know, I could like put a Google Doc in front of you and I'm pretty sure you could just like keep typing for a while on like different ideas you have.
And there's some bottleneck between that and just like making the models better immediately. Yeah, walk me through like what is the inference you're making from the first early steps that makes you have better experiments and better ideas? I think one thing that I didn't fully convey before was that I think a lot of, like, good research comes from working backwards from the actual problems that you want to solve.
And there's a couple of, like, grand problems in, like, making the models better today that you would identify as issues and then, like, work back from, okay, how could I how could I change it to achieve this? There's also a bunch of when you scale, you run into things and you want to fix behaviors or issues at scale, and that informs a lot of the research for the next increment and this kind of stuff. So concretely, the barrier is a little bit software engineering, like often having code base that's large and uh sort of capable enough that it can support many people doing research at the same time makes it complex if you're doing everything by yourself your iteration pace is going to be much faster i've heard that like alec radford for example like famously did much of the pioneering work at open AI he like mostly works out of like a jupiter notebook and then like has someone else who like writes and productionizes code for him.
I don't know if that's true or not. But that kind of stuff, like actually operating with other people raises the complexity a lot because not for natural reasons, like familiar to like every software engineer.
And then the inherent running, like running and launching those experiments is easy, but there's inherent time, like slows down is induced by that. So you often want to be paralyzing multiple different streams because one, you can't like be totally focused on one thing necessarily.
You might not have like fast enough feedback cycles. And then intuiting what went wrong is actually really hard.
Like working out what, like this is in many respects the problem that the team that Trenton is on is trying to better understand is like what is going on inside these models. We have inferences and understanding and like headcanon for why certain things work, but it's not an exact science.
And so you have to constantly be making guesses about why something might've happened, what experiment might reveal, whether that is or isn't true. And that's probably the most complex part.
The performance work, comparatively, is easier, but harder in other respects. It's just a lot of low-level and difficult engineering work.
Yeah, I agree with a lot of that. But even on the interpretability team, I mean, especially with Chris Ola leading it, there are just so many ideas that we want to test.
And it's really just having the engineering skill, but I'll put engineering in quotes because a lot of it is research, to very quickly iterate on an experiment, look at the results, interpret it, try the next thing, communicate them, and then just ruthlessly prioritizing what the highest priority things to do are. And this is really important.
Like the ruthless prioritization is something which I think separates a lot of like quality research from research that doesn't necessarily succeed as much. We're in this funny field where so many of our initial theoretical understanding is broken down, basically.
And so you need to have this simplicity bias and like ruthless prioritization over what's actually going wrong

and i think that's one of the things that separates the most effective people is they don't necessarily

get like too attached to solving but using a given so like a solution that they're necessarily

familiar with um but rather they attack the problem directly um you see this a lot uh in like

maybe people come in with a specific academic background they try and solve problems with that

Thank you. but rather they attack the problem directly.
You see this a lot in like maybe people come in with a specific academic background. They try and solve problems with that toolbox.
And the best people are people who expand the toolbox dramatically. They're running around and they're taking ideas from reinforcement learning but also from optimization theory and also they have a great understanding of systems and so they know what the sort of constraints that bound the problem are and they're good engineers so they can iterate and try ideas fast.
By far, the best researchers I've seen, they all have the ability to try experiments really, really, really, really, really fast. And that's cycle time.
At smaller scales, cycle time separates people. I mean, machine learning research is just so empirical.
Yeah. And this is honestly one reason why I uh our solutions might end up looking more brain-like than otherwise uh it's like even though we wouldn't want to admit it the whole community is kind of doing like greedy evolutionary optimization over the landscape of like possible ai architectures and everything else uh it's like no better than evolution and that's not even necessarily a a slight against evolution.
That's such an interesting idea. I'm still confused on what will be the bottleneck for these, what would have to be true of an agent such that it's like sped up your research.
So in the Alec Radford example you gave where he apparently already has the equivalent of like co-pilot for his Jupyter notebook experiments. Is it just that if he had enough of those he would be a dramatically faster researcher and so you just need alec rap for so it's like you're not automating the humans you're just making the most effective researchers who have great taste more effective and like running the experiments for them and so forth or like um like you're still working at the point with which the intellectual explosion is happening you know what i mean like is that what you're saying or right um and if that were like directly true why can't we scale our current research teams better for example is i think an interesting question to ask like why if this work is so valuable why can't we take hundreds or thousands of people who are like they're definitely out there um and like scale our organizations better.
I think we are less at the moment bound by the sheer engineering work of making these things than we are by compute to run and get signal and taste in terms of what the actual right like right thing to do and that like making those difficult inferences on imperfect information um for for the gemini team because i think for interpretability right we actually really want to keep hiring talented engineers and i think it's a big bottleneck for us to just keep making a lot of obviously more like more people is like better um But I do think it's interesting to consider. I think one of the biggest challenges that I've thought a lot about is how do we scale better? Google is an enormous organization.
It has 200,000-ish people, right? Like maybe 80,000 or something like that. And one has to imagine if there were ways of scaling out Gemini's research program to all those fantastically talented software engineers.
This seems like a key advantage that you would want to be able to take advantage of, you'd want to be able to use, but how do you effectively do that? It's a very complex organizational problem. So compute and taste, that's interesting to think about because at least the compute part is not bottlenecked on more intelligence.
It just bottlenecked on Sam 7 trillion or whatever, right? Yeah, yeah. So if I gave you 10x the H100s to run your experiments, how much more effective a researcher are you? How much more effective a researcher are you? I think the gemini program would probably be like maybe five times faster with 10 times more compute or something like that so that's pretty good elasticity of like 0.5 yeah wait that's insane yeah i think like more compute would just like directly convert into progress so you have some um some fixed size of compute and some of it goes to inference and also to clients of GCP.
Yep. Some of it goes to training and I guess as a fraction of it, some of it goes to running the experiments for the full model.
Yeah, that's right. Shouldn't then the fraction goes to experiments be higher given that you would just be like, if the bottleneck is research and research is bottlenecked by compute.
And so one of the strategic decisions that every pre-training team has to make is exactly what amount of compute you allocate to your different training runs. To your research program versus scaling the last best thing that you landed on.
And I think they're all trying to arrive at a pretty optimal point here. One of the reasons why you need to still keep training big models is that you get information there that you don't get otherwise.
So scale has all these emergent uh which you want to understand better and if you like are always doing research and never like remember what i said before about like you're not sure what's going to like fall off the curve right yeah um if you like keep doing research in this regime yeah uh and like keep on getting more and more compute efficient you may never you may have actually like gone off the path to actually eventually scale. So you need to constantly be investing in doing big runs too at the frontier of what you sort of expect to work.
Okay, so then tell me what it looks like to be in the world where AI has significantly sped up AI research. Because from this, it doesn't really sound like the AIs are going off and writing the code from scratch and that's leading to faster output.
It sounds like they're really augmenting the top researchers in some way. Like, yeah, tell me concretely, are they doing the experiments? Are they coming up with the ideas? Are they just like evaluating the outputs of the experiments? What's happening? So I think there's like two walls you need to consider here.
One is where AI has meaningfully sped up our ability to make algorithmic progress. Right.
And one is where the output of the AI itself is the thing that's like the crucial ingredient towards like model capability progress. And like specifically what I mean there is synthetic data, right? And in the first world, where it's meaningfully speeding up algorithmic progress, I think a necessary component of that is more compute.

And you probably like reach this elasticity point where like AIs maybe at some point are easier to speed up and get on to context than yourself.

That's right, than other people.

And so AIs meaningfully speed up your work because they're like a fantastic copilot, basically, that helps you code multiple times faster. And that seems like actually quite reasonable.
Super long context, super smart model. It's onboarded immediately and you can like send them off and to like complete subtasks and sub goals for you.
And that actually like feels very plausible. But again, we don't know because there are no great evals about that kind of thing um like the best one is i said before sweet bench which although in that one somebody was mentioning to me like the problem is that when a human is trying to do a pull request they'll like type something out and they'll like run it and see if it works and if it doesn't they'll rewrite it none of this was part of the the um the opportunities that the llm was given and run on this, like it just like output it.
And if it doesn't, they'll rewrite it. None of this was part of the opportunities that the LLM was given when run on this.
Like it just like output it and if it runs and like checks all the boxes, then, you know, it passed, right? So it might've been an unfair test in that way. So you can imagine that is, like if you were able to use that, that would be an effective training source for having, like the key thing that's missing from a lot of training data is uh is like the the reasoning traces right and i think this would be if i wanted to try and automate a specific field uh with like job family um or like understand how how like at risk of automation that is then having reasoning traces feels to me like a really important part of that.
There's so many threads. Yeah, there's so many different threads in that I want to follow up on.
Let's begin with the data versus compute thing of like, is the output of these AI is the thing that's causing the intelligence collision or something? Yeah. People talk about how these models are really a reflection on their data.
Yeah. I think there was, I forgot his name, but there was a, there's a great blog by this open AI engineer.
And he was talking about, at the end of the day, as these models get better and better, it's just like, they're just going to be really effective, like maps of the data set. Yeah.
And so it's like, at the end of the day, like you got to stop thinking about architectures. It's like the most effective architecture is just like doing an amazing job of mapping the data.
Right. So that implies that future AI progress comes from the AI just making really awesome data, right? Like that you're mapping to.
I think that's clearly a very important part. Yeah.
Yeah, that's really interesting. Does that look to you like, I don't know, like things that look like chain of thought or what do you imagine as these models get better, as these models get smarter, what does the synthetic data look like? When I think of really good data, to me that raises something which involves a lot of reasoning to to create.
In modeling that, it's similar to Ilya's perspective on achieving superintelligence via effectively perfectly modeling the human textual output. But even in the near term, in order to model something like the archive papers or Wikipedia, you have to have an incredible amount of reasoning behind you in order to understand what next token might be being output.
And so for me, what I imagine as good data is like model, like data where you can similarly, at least like where it had to do reasoning to produce something. And then like the trick, of course, is how do you verify that that reasoning was correct?

And this is why you saw DeepMind do that geometry,

like the sort of self-play for geometry, basically,

or the sort of tree search for your geometry.

Because geometry is a really, it's easily formalizable,

easily verifiable field.

So you can check if its reasoning was correct.

And you can generate heaps of data of correct,

like verified geometry proofs, train on that. And you know that that's good data.
It's actually funny because I had a conversation with Grant Sanderson, like, last year, where we were debating this. And I was like, fuck, dude.
By the time they get the gold of the Math Olympiad, of course they're going to automate all the jobs. Yikes.
On the synthetic data thing uh one of the things i speculated about in my scaling post which was heavily informed with discussions with you too and you especially was um you can think of like human evolution through the spectrum like we get language and so we're like generating the synthetic data which know, like, our copies are generating the synthetic data, which we're trained on. And it's, like, this really effective genetics, cultural, like, co-evolutionary loop.
And there's a verifier there, too, right? Like, there's the real world. You might generate a theory about the gods cause the storms, right? And then, like, someone else finds cases where that isn't true.
And so you, like, know that that know that that like that sort of didn't match your verification function and now like actually instead you have like some uh weather simulation which required a lot of reasoning to produce and like accurately matches reality uh and like you can train on that as a better model of the world like we are training on on that and like stories and like scientific theories yeah um i want to go back i'm just remembering something you mentioned uh a little while ago of given how sort of like empirical ml is it really is an evolutionary process that's resulting in better performance and not necessarily an individual coming up with a breakthrough in like a top-down way um that has interesting implications first being that there really is people people like are concerned about capabilities increasing because more people are going into the field i've somewhat been skeptical of that way of thinking but from this perspective of just like more input it really does yeah it feels more like, oh, actually by like the fact that more people are going to ICML means that there's like faster progress towards GPT-5. Yeah, you just have more genetic recombination.
Right. And like shots on target.
Yeah. And I mean, aren't both fields kind of like that? Like there's the sort of scientific framerate of like discovery versus invention.
And discovery almost involves whenever there's been a massive scientific breakthrough in the past.

Typically, there are multiple people co-discovering that at roughly the same time.

And that feels to me at least a little bit like the mixing and trying of ideas. You can't try an idea that's so far out of scope that you have no way of verifying it with the tools you have available.

Yeah, I think physics and math might be slightly different in this regard. Yeah.
But especially for biology or any sort of wetware, and to the extent we want to analogize neural networks here, it's comical how serendipitous a lot of the discoveries are. Yeah.
Like penicillin, for example. Another implication of this is this idea that HGI is just going to come tomorrow.
Like somebody's going to discover a new algorithm and we have HGI. That seems less plausible.
It will just be a matter of more and more ML researchers finding these marginal things that all add up together to make models better. That feels like the correct story to me.
Especially while we're still hardware constrained right do you buy this uh narrow window framing of the intelligence explosion of you have to each you know gpd3 gpd4 is two ooms of orders of magnitude more compute or at least more effective compute um in the sense that if you didn't have any algorithmic progress it would have to be two orders of magnitude bigger like the raw form to be as good um do you buy the framing that given that you have to be two orders of magnitude bigger at every generation if you don't get agi by gpd7 that can help you catapult an intelligence explosion like you're kind of just fucked as far as like much smarter intelligences go and you're kind of stuck with GPT-7 level models for a long time. Because at that point, you're just consuming significant fractions of the economy to make that model and we just don't have the wherewithal to make GPT-8.
This is the Carl Schulman sort of argument of like, we're going to race through the order's magnitude in the near term, but then longer term, it would be harder um i think like he's probably talked about it but yeah but i do buy do buy that framing um yeah i mean i i generally buy that increases in order of magnitude to compute by like in absolute terms almost like diminishing returns on like your capability right like we've seen over a couple orders magnitude models go from being unable to do anything to be able to do huge amounts. And it feels to me like each incremental order of magnitude gives more nines of reliability at things and so it unlocks things like agents.
But at least at the moment, I haven't seen transformatively... It doesn't feel like reasoning improves linearly, so to speak, but rather somewhat sub-lineally.
That's actually a very bearish sign because one of the things we're chatting with one of our friends and he made the point that if you look at what new applications are unlocked by GPT-4 or relative to GPT-3.5, it's not clear that's like that much. Like a GPT-3.5 can do perplexity or whatever.
So if there is this diminishing increase in capabilities and that increase costs exponentially more to get, that's actually a bearish sign on like what 4.5 will be able to do or what 5 will unlock in terms of economic impact. That being said, for me, the jump between 3.5 and 4 is like pretty huge.
And so like, even if I, it's like another 3.5 to 4 jump is like ridiculous, ridiculous, right? Like, if you imagine 5 as being a 3.5 to 4 jump, like, straight off the bat in terms of, like, ability to do SATs and this kind of stuff. Yeah, the LSAT performance was, like, particularly striking.
Exactly. You go from, like, you know, like, very smart, like, from, like, you know, not super smart to, like, very smart to, like, utter genius in the next generation instantly.
And it doesn it doesn't at least like to me feel like we're going to sort of jump to utter genius in the next generation. But it does feel like we'll get very smart plus lots of reliability.
And then we'll see TBD, what that continues to look like. Will GoFi be part of the intelligence explosion? where like you say synthetic data, but like, in fact, it will be like it writing

its own source code in some important way. Will GoFi be part of the intelligence explosion? Where like, you say synthetic data,

but like, in fact, it will be like it writing

its own source code in some important way.

There was an interesting paper that you can use diffusion

to like come up with model weights.

I don't know how like legit that was or whatever,

but like, I don't know, something like that.

Can you, so GoFi is good old fashioned AI, right?

And can you define that?

Because when I hear it, I think like if-else statements for like symbolic logic. Sure.
I actually want to make sure we like don't, like we like fully unpack the whole like model improvement increments. Yeah.
Because I don't want people to come away with the perspective that like, actually this is super bearish and like models aren't going to get much better and stuff. Okay.
More what i want to emphasize is like the jumps that we've seen so far are huge um and even if those like continue on like a smaller scale we're still in for extremely smart uh like very reliable agents like over the next couple of orders of magnitude and so like we didn't sort of fully close the thread on the narrow window thing um but when you you think of, let's say, GPT-4

cost, I don't know, let's call it

$100 million or whatever,

you have the 1B run,

the 10B run, the 100B run, all seem

very plausible by

private company standards.

And then the...

You mean in terms of dollar?

In terms of dollar amount, yeah.

And then you can also imagine even

a 1T run being part of a national

consortium or a national

I don't know. You mean in terms of dollar? In terms of dollar amount.
Yeah. And then you can also imagine even like a 1T run being part of like a national consortium or like a national level thing, but much harder on the behalf of an individual company.
But Sammy is out there trying to raise $7 trillion, right? Like he's already preparing for like a whole order of magnitude more than the… Right. He shifted the Overton window.
He shifted the order of magnitude here beyond the national level um so i want to point out the one we have a lot more jumps and even if those jumps are relatively smaller that's still a pretty stark improvement in capability not only that but if you believe claims that gpt4 is around one trillion parameter count uh i mean the human brain is between 30 300 trillion synapses. And so that's obviously not a one-to-one mapping, and we can debate the numbers, but it seems pretty plausible that we're below brain scale still.
So crucially, the point being that the algorithmic overhead is really high in the sense that, and maybe this is something we should touch on on explicitly of even if you can't keep dumping more compute beyond the models that cost a trillion dollars or something the fact that the brain is so much more data efficient implies that if you get we have the compute if we had like the brain's algorithm to train um uh training if you had if we could like train as a sample efficient as humans train from birth we could make the agi yeah but the sample efficiency stuff i never know exactly how to think about it because obviously a lot of things are are hardwired in certain ways right and like the co-evolution of language and the brain structure um so it's hard to say also there are some results that uh if you make your model bigger it becomes more sample efficient um yeah and so the original scaling was paper right that right yeah logic models almost right so so maybe that also just solves it um like you don't have to be more data efficient but if your model's bigger than you also just are more efficient like well how do we think about yeah how do what is like the explanation or why that would be the case like a bigger model just sees the exact same data at the end of seeing that data it's learn more from it i mean my like very naive take here would just be that like like so so one thing that the superposition hypothesis that interpretability has pushed uh is that your model is dramatically under parametrized and and that's typically not the narrative that deep learning is pursued, right? But if you're if you're trying to train a model on like the entire internet and have it predicted with incredible fidelity, you are in the under parameterized regime. And you're having to compress a ton of things and take on a lot of noisy interference in doing so.
And so having a bigger model, you can just have cleaner representations that you can work with. Yeah.
For the audience, you should unpack why that, first of all, what superposition is and why that is the implication of superposition. Sure.
Yeah. So the fundamental result, and this was before I joined Anthropic, but the paper's titled Toy Models of Superposition, finds that even for small models, if you are in a regime where your data is high dimensional and sparse, And by sparse, I mean any given data point doesn't appear very often, your model will learn a compression strategy, which we call superposition, so that it can pack more features of the world into it than it has parameters.
And so the sparsity here is like, and I think both of these constraints apply to the real world and modeling internet data is a good enough proxy for that. Of like, there's only one Dwarkeash.
Like there's only one shirt you're wearing. There's like this liquid death can here.
And so these are all objects or features and how you define a feature is tricky. And so you're in a really high dimensional space because there are so many of them and they appear very infrequently.
And in that regime, your model will learn compression. To riff a little bit more on this, I think it's becoming increasingly clear.
I will say, I believe that the reason networks are so hard to interpret is because in a large part, this superposition. So if you take a model and you look at a given neuron in it, right, a given unit of computation, and you ask, how is this neuron contributing to the output of the model when it fires? And you look at the data that it fires for, it's very confusing.
It'll be like 10% of every possible input or like Chinese, but also fish and trees and the word, a full stop in URLs, right?

But the paper that we put out towards monosemanticity last year shows that if you project the activations into a higher dimensional space and provide a sparsity penalty, so you can think of this as undoing the compression in the same way that you assumed your data was originally

high dimensional and sparse, you return it to that high dimensional and sparse regime,

you get out very clean features. And things all of a sudden start to make a lot more sense.

Okay. There's so many interesting threads there.
The first thing I want to ask is

the thing you mentioned about these models are trained in a regime where they're over parametrized isn't that when you have generalization like grokking happens in that regime right so um so so i was saying the models were under parametrized oh yeah like typically people talk about deep learning as if the model is over parameterized. Um, but, but actually the claim here is that they're dramatically under parameterized given the complexity of the task that they're trying to perform.
Um, another question. So the distilled models, like it, first of all, okay.
So what is happening there? Cause early, the earlier claims we're talking we were talking about is the smaller models are worse at learning than bigger models.

But like GPT-4 Turbo, you could make the claim

that actually GPT-4 Turbo is worse at reasoning style stuff

than GPT-4, but probably knows the same facts.

Like the distillation got rid of some of the reasoning things.

Do we have any evidence that GPT-4 is a distilled version of 4? It might just be in your architecture. Oh, okay.
yeah

like it could just be like

like like like like like like like yeah like it could just be like a faster more efficient your architecture okay interesting so that's cheaper yeah um what is the how do you like interpret what's happening in distillation i think gwen had one of these questions on his website uh why can't you train the distilled model directly why does it have to go to go through? Is a picture like you had to project it from this bigger space to a smaller space? I mean, I think both models will still be using superposition. But the claim here is that you get a very different model if you distill versus if you train from scratch.
Yeah. And it's just more efficient or it's just fundamentally different in terms of performance.
I don't remember, but like, do you know? I think like the traditional story for why distillation is more like efficient is that normally during training, you're trying to predict this like one hot vector that says like, this is the token that you should have predicted. And if your like reasoning process means that means that you're really far off predicting that, then actually, like, you sort of get these gradient updates that, yeah, are in the right direction, but, like, you're totally, it might be really hard for you to learn, to have learned to have predicted that in the context that you're in.
And so what distillation does is it doesn't just have the one hot vector, it has, like, the full readout from the larger model, like, of all of the probabilities. Yeah, yeah yeah and so you get more signal about what you should have predicted it's not it's in some respects it's like showing a tiny bit if you're working too yeah you know like it's not just this was the answer it's i see yeah totally yeah but that makes a lot of sense it's kind of like watching a kung fu master versus being in the matrix and like just downloading yeah exactly exactly yep

yep um just just to make sure the audience got that when you're turning on a distilled model you you're like you see all its probabilities over the tokens it was predicting and then over the ones you were predicting and then you like update through all those probabilities rather than just seeing the last word and updating on that okay so this actually raises a question I was intending to ask you.

Right now, I think you were the one who mentioned

you can think of chain of thought as adaptive compute

of like, to step back and explain what,

but by adaptive compute, it's, the idea is

one of the things you would want models to be able to do

is if a question is harder, to spend more cycles thinking about it. And so then how do you do that? Well, there's only a finite and predetermined amount of compute that one forward pass implies.
So if there's a complicated reasoning type question or math problem, you want to be able to spend a long time thinking about it. Then you do chain of thought where the model just like thinks through the answer.
And you can think about it as like all those forward passes where it's like thinking through the answer. It's like being able to dump more compute into solving the problem.
Now, going back to the signal thing, when it's doing chain of thought, it's only able to transmit that token of information, where it's like, as you were talking about, the residual stream is already a compressed representation of everything that's happening in the model. And then you're turning the residual stream into one token, which is like log of 50,000 or log of vocab size bits, which is like, yeah, so tiny.
So I don't think it's quite only transmitting like that one token, right? Like if you think about it during a forward pass, you create these like KV values in a transform forward pass that then like future steps attend to the KV values. And so all of those pieces of KV, of like keys and values, are bits of information that you could use in the future.

Is the claim that when you fine tune on chain of thought,

the way the key and value weights change so that the sort of steganography can happen in the KV cache?

I don't think I could make that strong a claim.

But that sounds plausible.

But it's like, that's a good headcanon for why it works.

And I don't know if there's any like papers explicitly demonstrating that or anything like that um but like that's at least one way that you can imagine the model has over the like during pre-training right the model's trying to predict these future tokens um and one thing that you can imagine doing is learning to like smushush information about potential futures into the keys and values that it might want to use in order to predict future information.

Like it kind of smooths that information across time and the pre-training thing.

So I don't know if like people are particularly training, like training on change of thought.

I think the original chain of thought paper had that as like almost an immersion property of the model is you could like prompt it to do this kind of stuff. And it still worked pretty well.
But that's like, yeah, it's a good headcanon for why that works. Yeah, to be overly panantic here, it's like the tokens that you actually see in the chain of thought do not necessarily at all need to correspond to the vector representation that the model gets to see when it's deciding to attend back to those tokens.
Exactly. In fact, like during training, you replace, like what a training step is, is you actually replacing the token, the model output with the real next token.
And yet it's still like learning because it has all this information internally. Like when you're getting a model to produce at inference time, like you're taking the output, the token that did output, you're feeding it in the bottom, unembedding it, and it like becomes the beginning of the new residual stream.
Right, right. And then you use the output of past KBs to like read into and adapt that residual stream.
At training time, you do this thing called teacher forcing basically, where you're like, actually the token you were meant to output is this one that's how you do it in parallel right

because you have all the tokens you put them all in in parallel and you do the giant forward pass

um and so the only information it's getting about the past is the keys and values it never sees the

token that it output it's kind of like it's trying to do a the next token prediction and if it messes

up then you just give it the correct answer yeah right right yeah okay that makes sense otherwise

Thank you. it's kind of like it's trying to do the next token prediction and if it messes up then you just give it the correct answer yeah right right yeah okay that makes sense otherwise it can become totally derailed yeah it would go like off the train tracks um how how much like this sort of secret communication with the model to its forward uh forward inferences how much how much steganography and you know like secret communication do you expect there to be? We don't know.
Like, honest answer, we don't know. But I wouldn't even necessarily classify it as secret information, right? Like, a lot of the work that Trend's team is trying to do is actually understand that these are fully visible from the model side and from from like this uh maybe not the user but like we should be able to understand and interpret what these values are doing and the information they're transiting like transmitting i think that's a really important like goal for the future yeah i mean there are some wild papers though where people have had the model do chain of thought and it is not at all representative of what the model actually decides its answer is.
And you can go in and edit. No, no, no.
In this case, like you can even go in and edit the chain of thought so that the reasoning is like totally garbled and it will still output the true answer. But also the chain of thought, like, yeah, it gets a better answer at the end of the chain of thought rather than not doing it at all.
So like something useful is happening, but still the useful thing is not human understandable? I think in some cases you can also just ablate the chain of thought and it would have given the same answer anyways. Interesting.
Interesting. I'm not saying this is always what goes on, but there's plenty of weirdness to be investigated.
It's very interesting to go and look at and try and understand, I would say yeah that you can do with open source models and like i think i wish there was more of this kind of interpretability and understanding work done on open models yeah i mean even in our anthropics recent sleeper agents paper um which uh the at a high level for for people unfamiliar is basically i train in a trigger word and when i say it like if i say if it's the year 2024 the model will write malicious code instead of otherwise uh and they do this attack with a number of different models um some of them use chain of thought some of them don't um and those models respond differently when you try and remove the trigger um you can even see them do this like comical reasoning that's also pretty creepy and like where it's like, oh, well, it even tries to calculate in one case an expected value of like, well, the expected value of me getting caught is this. But then if I multiply it by the ability for me to like keep saying, I hate you, I hate you, I hate you, then like this is how much reward I should get.
And then it will decide whether or not to like actually tell the interrogator that that it's like malicious or or not oh um but but even i mean there's another paper from a friend miles turpin uh where you ask the model to you give it like a bunch of examples of um where like the correct answer is always a for multiple choice questions. And then you ask the model, what is the correct answer to this new question? And it will infer from the fact that all the examples are A, that the correct answer is A, but its chain of thought is totally misleading.
Like it will make up random stuff that sounds plausible or that tries to sound as plausible as possible um but it's not at all representative of like the true answer but isn't this how humans think as well the famous split brain experiments where um you know like where it when a person who is suffering from seizures one way to solve it is you you cut the thing that connects the two halves of the brain.

And then, yeah, the speech half is on the left side,

so it's not connected to the part that decides to do a movement.

And so if the other side decides to do something,

the speech part will just make something up,

and the person will think that's legit the reason they did it.

Totally, yeah, yeah.

It's just some people will hail chain of thought reasoning

as a great way to solve AI safety.

Oh, I see. And it's like, actually, we don't know whether we can trust it.
How much, what will this landscape of models communicating to themselves in ways we don't understand, how does that change with AI agents? Because then these things will, it's not just like the model itself with its previous caches, but like other instances of the model. And then...
It depends a lot on what channels you give them to communicate with each other, right? Like if you only give them text as a way of communicating, then they probably have to interpret. How much more effective do you think the models would be if they could like share the residual streams versus just text? Hard to know.
But plausibly so. I mean way that you can imagine this is if you wanted to describe how a picture should look, only describing that with text would be hard.
Maybe some other representation would plausibly be easier. You can look at how Dali works at the moment, right? It produces those prompts.

And when you play with it,

you often can't quite get it to do exactly what the model wants or what you want.

Only Dali has that problem.

It's too easy.

A lot of images. Related models have that problem.
You can imagine being able to transmit some kind of denser representation of what you want would be helpful there. And that's two very simple agents.
I think a nice halfway house here would be features that you learn from dictionary learning. Yeah, that would be really cool.
You get more internal access, but a lot of it is much more human interpretable. Yeah.
So for the audience, you would project the residual stream into this larger space where we know what each dimension actually corresponds to. And then back into the next agents or whatever.
Okay, so your claim is that we'll get AI agents when these things are more reliable and so forth. When that happens, do you expect that it will be multiple copies of models talking to each other, or will it be just adaptive computer solved and the thing just like runs bigger like more compute when it needs to do a kind of thing that a whole firm needs to do and I ask this because there's two things that make me wonder about like whether agents is the right way to think about what will happen in the future one is with longer context these models are able to ingest and consider the information that no human can.
And therefore we need like one engineer who's thinking about the front end code and one engineer who's thinking about the back end code. Where this thing can just ingest the whole thing.
This sort of like Hayekin problem of specialization goes away. Second, these models are just like very general of you're like not using different types of GPT-4 to do different kinds of things.
You're using the exact same model, right? So I wonder what that implies in the future, like an AI firm is just like a model instead of a bunch of AI agents hooked together. That's a great question.
I think especially in the near term, it will look much more like agents hooked together. And I say that like purely purely because as humans we're going to want to have these like isolated reliable and uh like like like components that we can trust um and we're also going to want to we're going to need to be able to improve and instruct upon those like components um in in ways that we can understand and improve like just destroying it all this giant black box company.
Like one, it isn't going to work initially. Later on, of course, you can imagine it working, but initially it won't work.
And two, we probably don't want to do it that way. You can also have each of the smaller, well, each of the agents can be a smaller model that's cheaper to run.
And you can fine-tune it so that it's actually good at the task.

There's a future with, like, Dworkesh has brought up adaptive compute a couple times.

There's a future where, like, the distinction between small and large models, like, disappears to some degree.

And with long context, there's also a degree to which fine-tuning might disappear, to be honest.

Like, these two things that are very important today, and, like, today's landscape of models, we landscape models we have like whole different tiers of model sizes and we have fine-tuned models of different things you can imagine a future where you just actually have a dynamic bundle of compute and uh like infinite context um and the con that specializes your model to to different things one thing you can imagine is you have an AI firm or something,

and the whole thing is like end-to-end trained on the signal

of like, did I make profits?

Or like, if that's like too ambiguous,

if it's an architecture firm and they're making blueprints,

did my client like the blueprints?

And in the middle, you can imagine agents who are salespeople

and agents who are like doing the designing,

agents who like do the editing, whatever.

Would that kind of signal work on an end-to-end system like that because like one of the things that happens in human firms is management considers what's happening at the larger level and like gives these like uh fine-grained signals to the the pieces or something when like there's a bad quarter or whatever yeah in the limit yes that's the dream of, right? It's like all you need to do is provide this extremely sparse signal

and then over enough iterations,

you sort of create the information

that allows you to learn from that signal.

But I don't expect that to be the thing that works first.

I think this is going to require

an incredible amount of care

and like diligence on the behalf of humans

surrounding these machines

and making sure they do exactly the right thing and exactly what you want and giving them right signals to improve in the ways that you want. Yeah, you can't train on the RL reward unless the model generates some reward.
Yeah, exactly. You're in this sparse RL world where if the client never likes what you produce, then you don't get any reward at all and it's kind of bad.

But in the future, these models will be good enough to get the reward some of the time, right? This is the nines of reliability that Shulton was talking about. Yeah.
There's an interesting digression, by the way, on earlier we were talking about, well, we want dense representations that will be favored, right? That's a more efficient way to communicate. a book that Trenton recommended

The Symbolic Species

has this really interesting

argument that like that, that will be favored, right? Like that's a more efficient way to communicate. A book that Trenton recommended, the symbolic species has this really interesting argument that language is not just a thing that like exists, but like it was also something that evolved along with our minds and specifically evolved to be both easy to learn for children and to something that helps children develop right like it's on back up for me um because like a lot of the things that children learn are received through language like the languages that will be the fittest are ones that help like raise the next generation, right?

And that like makes them smarter, better, whatever.

Like gives them the concepts to express more complex ideas. Yeah, that and I guess more pedantically just like not die.
Right, sure. Lest you encode the important shit to not die.
and so then when we just think of like language

it's like oh you know it's like this contingent

and maybe suboptimal way to represent ideas. Actually, maybe one of the reasons that LLMs have succeeded is because language has evolved for tens of thousands of years to be this sort of cast in which young minds can develop, right? Like that is the purpose it was evolved for.
Certainly when you talk to like multimodal or like computer vision researchers versus when you talk to language model researchers, people who work in other modalities have to put enormous amounts of thought into exactly what the right representation space for the images is. And like what the right signal to learn from there.
Is it like directly modeling the or is it uh you know some loss that's conditioned on uh there's there's like a paper ages ago where they like found that if you trained on the internal representations of an image net model it like helped you predict better um but then later on like that's obviously like limiting and so there was like pixel cnn where they're trying to like discreetly model um you know the the individual pixels and stuff but understanding the representation there, really hard. In language, people are just like, well, I guess you just predict that thanks token, right? It's kind of easy.
Decisions made. I mean, there's the tokenization, like, discussion and debate about, like, but one of Gwyn's favorites.
Yeah. Yeah, that's really interesting.
How much the case for a multimodal being a way to bridge the data wall or get past the data wall yeah is is like based on the idea that the things you would have learned from more language tokens anyway you can just get from youtube it has that actually been the case uh how much like positive transfer do you see between different modalities where

actually the images are helping you be

better at writing code or something?

The model is learning a latent

capability just from trying to understand the image.

Demas

in his interview with you mentioned

positive transfer.

I can't get in trouble.

I can't

say heaps about that. Other than to say, this is something that people believe that, yes, we have all of this data about the world.
It would be great if we could learn an intuitive sense of physics from it that helps us reason. That seems totally plausible.
and yeah i i'm the wrong person to ask but there are interesting interpretability pieces where if we fine-tune on math problems the model just gets better at entity recognition so there's like a paper from david bowels lab recently where they investigate um what actually changes in a model when I fine tune it with respect to the attention heads

and these sorts of things.

Fascinating.

And they have this like synthetic problem of box A

has this object in it.

Box B has this other object in it.

What was in this box?

And it makes sense, right?

It's like you're better at like attending

to the positions of different things which you need

for like coding and manipulating math equations.

I love this kind of research.

What's the name of the paper?

Do you know?

If you look up like fine tuning models, David Bow's group that came out like a week ago.

Okay.

I'm not endorsing the paper.

That's like a longer conversation, but like this, it does talk about and cite other work on this entity recognition ability. One of the things you mentioned to me a long time ago is the evidence that when you train LLMs on code, they get better at reasoning and language.
Which, unless it's the case that the comments in the code are just really high quality tokens or something, implies that to be able to think through how to code better, like it makes you like a better reasoner. And like, like that's crazy, right? Like I think that's like one of the strongest pieces of evidence for like scaling, just making the thing smart.
Like that kind of like positive transfer. I think like this is true in two senses.
One is just that modeling code obviously implies modeling a difficult reasoning process used to create it, but two, that code is a nice explicit structure of composed reasoning, I guess. If this, then that.
Code's a lot of structure in that way. Yeah.
That you could imagine transferring to other types of reasoning problems. And crucially, the thing that makes us significant is that it's not just stochastically predicting the next token of words or whatever, because it's learned that a Sally corresponds to murder at the end of a Sherlock Holmes story.
No, if there is some shared thing between code and language, it must be at a deeper level than the model has learned. Yeah, I think we have a lot of evidence that actual reasoning is occurring in these models and that they're not just stochastic parrots.
It just feels very hard for me to believe that I've worked and played with these models. Normies who will listen will be like, you know.
Yeah, my two immediate cashed responses to this

are one, the work on Othello and now other games

where it's like, I give you a sequence of moves in the game

and it turns out if you apply some pretty straightforward

interpretability techniques,

then you can get a board that the model has learned.

And it's never seen the game board before anything, right?

That's generalization.

The other is Anthropix influence functions paper

that came out last year where they look at the model outputs

Let's go. seen the game board before anything right like that's generalization the other is anthropics influence functions paper um that came out last year where they look at uh the model outputs like please don't turn me off i want to be helpful and then they scan like what was the data that led to that and like one of the data points that was very influential was someone uh dying of dehydration in the desert and like having like a will to keep surviving.
And to me, that just seems like very clear

generalization of motive

rather than regurgitating don't turn me off.

I think 2001 A Space Odyssey

was also one of the influential things.

And so that's more related,

but it's clearly pulling in things

from lots of different distributions.

And I also like the evidence you see

even with very small transformers where you can explicitly encode circuits to do addition. Or induction heads.
Or induction heads, this kind of thing. You can literally encode basic reasoning processes in the models manually.
And it seems clear that there's evidence that they also learn this automatically because you can then rediscover those from trained models. To me, this is really strong.
The models are underparameterized. We're asking them to do a very hard task.
And they want to learn. The gradients want to flow.
And so they're learning more general skills. Okay, so I want to take a step back from the research and ask about your careers specifically because like the tweet implied at the that i introduced you with you've been in this field a year and a half i think you've only been in it like a year or something right it's like yeah but you know like uh in that time i i know the solve the lineman takes are overstated and you won't say this yourself because you'd be embarrassed.
But like, you know, it's like a pretty incredible thing. Like the thing that people in mechanistically really think is the biggest, you know, step forward.
And you've like been working on it for a year. It's notable.
So I'm curious how you explain what's happened. Like why in a year or year and a half have you guys been, you know, made important contributions to your field? It goes without saying luck, obviously.
And I feel like I've been very lucky in like the timing of different progressions has been just like really good in terms of advancing to the next level of growth. I feel like for the interpretability team specifically, I joined when we were five people.
We've now grown quite a lot. But there were so many ideas floating around and we just needed to really execute on them and have quick feedback loops and do careful experimentation that led to signs of life and have now allowed us to really scale.
And I feel like that's been my biggest value add to the team um which it's not all engineering but but quite a lot of it has been interesting so you're saying like you came at a point where like they were there was it had been a lot of science done and there was a lot of like good research floating around but they needed someone to like just take that and like maniacally execute on it yeah yeah and and and there's this is why it's not all engineering because it's like running different experiments and like having a hunch for why it might not be working and then like opening up the model or opening up the weights and like what is it learning okay well let me try and do this instead and that sort of thing but um a lot of it has just been being able to do like very careful thorough but quick um investigation of different ideas or or yeah theories and why was that lacking in the existing? I don't know. I feel like, I mean, I work quite a lot.
And then I feel like I just am like quite agentic. Like if your question's about like career overall, and I've been very privileged to have like a really nice safety net to be able to take lots of risks.
But I'm just like quite headstrong. Like in undergrad, Duke had this thing where you could just make your own major.
And it was like, I don't like this prerequisite or this prerequisite. And I want to take all four or five of these subjects at the same time.
So I'm just going to make my own major. Or like in the first year of grad school, I like canceled rotation so I could work on this thing that became the paper we were talking about earlier.
And like didn't have an advisor, like got admitted to do machine learning for protein design and was just like off in computational neuroscience land with no business there at all but but worked out there's a head strongness but it seemed like another theme that jumped out was uh the the ability to step back and you were talking about this earlier the ability to stick back from your sunk costs and go in a different direction is in a weird sense the opposite of that but also a crucial step here where i know like 21 year olds or like 19 year olds who are like uh this is not a thing i've specialized in or like i didn't major in this i'm like dude motherfucking you're 19 like you can definitely do this and you like switching in the middle of grad school or something like that's um just like yeah yeah sorry i didn't mean to cut you off, but I think it's strong ideas loosely held. And being able to just pinball in different directions.
And the headstrongness I think relates a little bit to the fast feedback loops or agency in so much as I just don't get blocked very often. If I'm trying to write some code and something isn't working, even if it's in another part of the code base, I'll often just go in and fix that thing.
Or at least hack it together to be able to get results. And I've seen other people where they're just like, help, I can't.
And it's like, no, that's not a good enough excuse. Go all the way down.
I've definitely heard people in management type positions talk about the lack of such people. Where they'll check in on somebody a month after they give them a task or a week after they give them a task.
And like, how's it going? And they say, well, you know, we need to do this thing, which requires lawyers because it requires talking about this regulation. It's like, how's that going? And it's like, well, we need lawyers.
I'm like, why didn't you get lawyers? Or something like that. So that's definitely like, yeah.
I think that's arguably the most important quality in like almost anything.

It's just pursuing it to like the end of the earth and like whatever you need to do to make it happen, you'll make it happen.

If you do everything, you'll win.

Exactly.

But yeah, yeah, yeah.

I think from my side, definitely that quality has been important, like agency and work.

There are thousands or I would even like probably tens of thousands of engineers at Google who are like, you know, basically, like we're all like equivalent, like software engineering ability, let's say. Like, you know, if you gave us like a very well-defined task, then we'd probably do it like equivalent development.
A bunch of them would do it a lot better than me, you know, in all likelihood. But what I've been, like one of the reasons that I've been impactful so far is I've been very good at picking extremely high leverage problems.

So problems that haven't been, like, particularly well solved so far.

Perhaps as a result of, like, frustrating structural factors, like the ones that you pointed out in, like, that scenario before, where they're like, oh, we can't do X because this team won't you do y or like and then going okay well i'm just going to like vertically solve the entire thing right um and that turns out to be remarkably effective also uh i'm very comfortable with like if i think there is something correct uh that needs to happen i will like make that argument and continue making that argument at escalating levels of like criticality until that thing gets solved and I'm also quite pragmatic with what like I do to solve things you get a lot of people who come in with as I said before like a particular background or their familiarity or they're like they know how to do something and they won't like one of the beautiful things about google right is you can run around and get world experts in literally everything you can sit down and talk to people who are optimization experts like tp like chip design experts uh like experts in fact i don't know like different forms of like pre like pre-training algorithms or like rl or whatever and you can learn from all of them and you can take those methods and apply them um and i think this was like maybe the the start of why i was initially impactful was like this vertical like agency effectively um and then a follow-up piece from that is i think it's often surprising how few people are like fully realizing all the things they want to do they're like blocked or or limited in some way. And this is very common in big organizations everywhere.
People like have all these blockers on what they're able to achieve. And I think being a, like one, helping inspire people to work on particular directions and working with them on doing things massively scales your leverage.
Like you get to work with all these wonderful people who teach you heaps of things and generally helping them push past organizational blockers means together you get an enormous amount done. None of the impact that I've had has been me individually going off and solving a whole lot of stuff.
It's been me maybe starting off a direction and then convincing other people that this is the right direction and bringing them along in like this big tidal wave of like effectiveness that goes and solves that problem. We should talk about how you guys got hired because I think that's a really interesting story because you were a McKinsey consultant, right? There's an interesting thing there where first of all, I think people are, yeah, generally people just don't understand how decisions are made about either admissions or evaluating who to hire or something.
Just talk about how were you noticed as you got hired? So the TLDR, I studied robotics in undergrad. I always thought that AI would be one of the highest laboratories to impact the future in a positive way.
The reason I am doing this is because I think it is one of our best shots at making a wonderful future, basically. And I thought that working at McKinsey, I would get a really interesting insight into what people actually did for work.
And I actually wrote this as the first line in my cover letter to McKinsey, was like, I want to work here so that I can learn what people do so that I can understand how they work. And in many respects, I did get that.
I asked a whole lot of other things. Many of the people there are wonderful friends.
I actually learned, I think, a lot of this agentic behavior in part from my time there where you go into organizations and you see how impactful just not taking no for an answer gets you. Like you would be surprised at the kind of stuff where like because no one quite cares enough in some organizations, things just don't happen because no one's willing to take direct responsibility.
This is incredibly, like directly responsible individuals are ridiculously important. And people are willing to, like they just don't care as much about timelines.
And so much of the value that an organization like McKinsey provides is hiring people who you were otherwise unable to hire for a short window of time where they can just like push through problems um i think people like underappreciate this uh uh and so like at least some of my well hold up like i'm going to become the directly responsible individual for this because no one's taking appropriate like responsibility i'm going to care a hell of a lot about this and i'm going to make sure like i'm going to end of the earth to make sure it gets done comes from that time but more to your like actual question of like how did I uh get get hired um the entire time I didn't get into the grad programs that I wanted to get into over here um which was specifically for focus on like robotics and RL research and that kind of stuff um and in the meantime on nights and weekends basically every night from 10 p.m 2 a.m., I would do my own research. And every weekend for like at least six to eight hours each day, I would do my own research and coding projects and this kind of stuff.
And that sort of switched in part from like quite robotics specific work to after reading Gwern's scaling hypothesis post,

I got completely scaling-pilled.

And I was like, okay, clearly the way that you solve robotics

is by scaling large multimodal models.

And then in an effort to scale large multimodal models

with a very, you know, grant,

I got a grant from the TPU access program,

the TensorFlow Research Cloud.

I was trying to work out how to scale that effectively. And James Bradbury, who at the time was at Google and is now at Anthropic, saw some of my questions online where I was trying to work out how to do this properly.
He was like, I thought I knew all the people in the world who were asking these questions. Who on earth are you? And he looked at that and he looked at some of the robotic stuff that I've been putting up on my blog and that kind of thing.
And he reached out and said, hey, do you want to have a chat and do you want to explore working with us here? And I was hired, as I understand it later, as an experiment in trying to take someone with extremely high enthusiasm and agency and pairing them with some of the best engineers that he knew. And so one another one of the reasons i could say like i've been impactful is i had this like dedicated mentorship from utterly wonderful people uh like people like reiner pope um who has since left to go uh do his own ship company um and some of skier james himself um many others um but those are like the sort of formative like two to three months at the beginning.
And they taught me a whole lot of like the principles and like heuristics that I apply, like how to and how to like solve problems in the way that they have, particularly in that like systems and algorithms overlap, where like one more thing that makes you like quite effective in ML research is really concretely understanding the systems side of things. And this is something I learned from them, basically, is a deep understanding of how systems influence algorithms and how algorithms influence systems.
Because the systems constrain the design space, the solution space, which you have available to yourself in the algorithm side. And very few people are comfortable fully bridging that gap.
But a place like Google, you can just go and ask all the algorithms experts

and all the systems experts everything they know,

and they will happily teach you.

And if you go and sit down with them,

they will teach you everything they know, and it's wonderful.

And this has meant that I've been able to be very, very effective

for both sides, for the pre-training crew,

because I understand systems very well.

I can intuit and understand this will work well or this won't, and flow that on through the inference considerations of models and this kind of thing um and for like to the chip design teams i'm one of the people they turn to to understand what chips they should be designing in three years because i'm one of the people who's best able to understand and explain the kind of algorithms that we might want to design in three years.

And obviously you can't make very good guesses about that, but I think I convey the information

well, accumulated from all of my compatriots on the pre-training crew and the general systems

inside crew, and convey that information well to them.

Because also even inference applies a constraint to pre-training. And so there's these trees of constraints where if you understand all the pieces of the puzzle, then you get a much better sense for what the solution space might look like.
There's a couple of things that stick out to me there. One is not just the agency of the person who was hired, but the parts of the system that we're able to think, wait, that's really interesting.
Who is this guy? Not from a grad program or anything. You know, like currently a McKinsey consultant, just like an undergrad.
But that's interesting. Let's like give this a shot.
Right. So James and whoever else that's like, that's very notable.
And that's second is I actually didn't know this part of the story where that was part of an experiment run internally about, can we do this? Can we like bootstrap somebody? And like, yeah. And in fact, what's really interesting about that is the third thing you mentioned is having somebody who understands all layers of the stack and isn't so stuck on any one approach or any one layer of abstraction is so important.
And specifically, what you mentioned about being bootstrapped immediately by these people might have meant that since you're getting up to speed on everything at the same time, rather than spending grad school going deep on one specific way of doing RL, you actually can take the global view and aren't totally bought in. So not only can it, is it something that's possible, but like has greater returns than just hiring somebody at a grad school potentially because this person can just like, I don't know, just like getting a GPT-8 and like fine tuning them on like one year of, you know what I mean? So yeah, that's a really good story.
You come at everything with fresh eyes and you don't come in lock to any particular field.

Now, one caveat to that is that during my self-experimentation and stuff, I was reading everything I could. I was obsessively reading papers every night.
Actually, funnily enough, I read much less widely now that my day is occupied by working on things. and in some respect I had like this very broad

perspective before

where not that many people

even like in a piece

of shit widely now that I like my day is occupied by working on things. And in some respect, I had like this very broad perspective before where not that many people, even like in a PhD program, you'll focus on a particular area.
If you just like read all the NLP work and all the computer vision work and like all the robotics work, you like see all these patterns that start to emerge across subfields in a way that I guess like foreshadowed some of the work that I would later do. That's super interesting.
One of the reasons that you've been able to be agentic within Google is like you're peer programming half the days or most of the days with Sergey Brin, right? And so that's really interesting that like, there's this person who's like willing to just push ahead on this LLM stuff and like get rid of the local blockers in its place.

I think it's important to give it.

It's like not like every day or anything that I'm preparing,

but like when there are particular projects that he's interested in,

then like we'll work together on those.

I'm like,

but there's also been times when he's been focused on projects with other

people.

But in general,

yes,

there's a surprising alpha to like being one of the people who actually

goes down to the office every day.

That like is really actually shouldn't't be but is surprisingly impactful. And as a result, I've benefited a lot from basically being close friends with people in leadership who care and being able to really argue convincingly about why we should do X as opposed to Y.
And having that like vector to try and like it's Google is a big organization. Having those vectors helps a little bit.
But also it's very important. It's the kind of thing you don't want to ever abuse, right? Like you want to make the argument through all the right channels.
And like only sometimes you need to. And is people like Sergey Brin, Jeff D, and so forth.
I mean, it's notable. I don't know.
I feel like Google is undervalued given that like, I don't know, like Steve Jobs is working on the equivalent of like the next product for Apple, like PureCore or Brimingon or something. Right.
I mean, like I've benefited immensely from like, okay, so for example, during the Christmas break, I was just going into the office a couple days during that time. It sounded like quite a lot of days.
Okay, quite a lot of days. Christmas day.
Christmas day. And I don't know if you guys have read that article about Jeff and Sanjay doing the pair programming, but they were there pair programming on stuff.
And I got to hear about all these cool stories of early Google where they're talking about crawling under the floorboards and rewiring data centers and telling me how many bits they were pulling off the, or how many bytes they were pulling off the instructions of a given compiler instruction. And all these crazy little performance optimizations they were doing, They were having the time of their life.
I got to sit there and really experience this sense of history in a way that you don't expect to get. You expect to be very far away from all that, I think, maybe in a large organization.
Yeah. That's super cool.
Trenton, does this map onto any of your experience? I think Shalto's story is more exciting. Mine was just very serendipitous in that I got into computational and neuroscience.
Didn't have much business being there. My first paper was mapping the cerebellum to the attention operation and transformers.
My next ones were looking at like sparsity. How old were you when you wrote that? It was my first year of grad school.
Okay. So 22.
Oh, yeah. But yeah, my next work was on sparsity in networks, like inspired by sparsity in the brain, which was when I met Tristan Hume.
And Anthropic was doing the SOLU, the softmax linear output unit work, which was very related in quite a few ways of like, let's make the activation of neurons across a layer really sparse. And if we do that, then we can get some interpretability of what the neuron's doing.
I think we've updated on that approach towards what we're doing now. So that started the conversation.
I shared drafts of that paper with Tristan. He was excited about it.
And that was basically what led me to become Tristan's resident and then convert to full-time. But during that period, I also moved as a visiting researcher to Berkeley and started working with Bruno Olshausen, both on what's called vector symbolic architectures, which one of the core operations of them is literally superposition.
And on sparse coding, also known as dictionary learning, which is literally what we've been doing since. And Bruno Oelshausen basically invented sparse coding back in 1997.
And so it was like my research agenda and the interpretability team seemed to just be running in parallel with just research taste. And so it made a lot of sense for me to work with the team.
And it's been a dream since. One thing I've noticed when people tell stories about their careers or their successes, they ascribe it way more to contingency.
But when they hear about other people's stories, they're like, of course it wasn't contingent. You know what I mean? It's like, if that didn't happen, something else would have happened.
I've just noticed that something like talk to, and it's interesting that you both think that it was especially contingent, whereas, I don't know, maybe you're right, but it's a sort of interesting pattern. Yeah, but I mean, I literally met Tristan at a conference and didn't have a scheduled meeting or anything, just joined a little group of people chatting, and he happened to be standing standing there and I happened to mention what I was working on.
And that led to more conversations. And I think I probably would have applied to Anthropic at some point anyways, but I would have waited at least another year.
I, yeah, I, it's still crazy to me that I can like actually contribute to interpretability in a meaningful way. I think there's an important aspect of like shots on goal there, so to speak, right? Where like you're even just choosing to go to conferences itself is like putting yourself in a position where luck is more likely to happen.
And like conversely, in my own situation, it was like doing all of this work independently and trying to produce and do interesting things was my own way of like trying to manufacture luck, so to speak. And like try and do something meaningful enough that it got noticed.
Given that you said you framed this in the context of they were trying to run this experiment of can something... So specifically James and I think our manager Brennan was trying to run this experiment.
It worked. Did they do it again? Yeah.
So my closest collaborator, Enrique, he crossed from search to our team. He's also been ridiculously impactful.
He's definitely a stronger engineer than I am. And he didn't go to university.
How was, like, what was notable about, for example, is James Bradbury is somebody who's, usually this kind of stuff is like farmed out to recruiters or something like that. Whereas James, like somebody whose time is worth like hundreds of millions of dollars.
You know what I mean? Like, uh, uh, so, uh, that, like that thing is like very bottlenecked on that kind of person taking the time almost in like aristocratic tutoring sense of, um, finding and then getting up to speed. Um, and it seems like if it works as well, it should be done at scale.
It should be the responsibility of key people to onboard and find. I think that is true to many extents.
I'm sure you probably benefited a lot from the key researchers mentoring you deeply during the work. And actively looking on open source repositories or on forums or whatever for potential people like this.
Yeah. I mean, is like Twitter injected into his brains.
Yeah. Into his brain.
That's right. All his brain virus.
But yes. And I think this is something which in practice is done.
Like people do look out for people that they find interesting and like try and find high signal. In fact, actually, I was talking about this with Jeff the other day.
And Jeff said that, yeah, he's like, you know, one of the most important hires I ever made was off a cold email. And I was like, well, who was that? And he's Chris Ola.
Ah, yeah. Because Chris similarly had no background in, well, like, no formal background in right and like google brain was just getting started in this kind of thing but Jeff saw that signal and the residency program which Brain had is I think also it was astonishingly effective at finding good people that didn't have strong ML backgrounds and yeah one of the other things that I want to emphasize for a potential slice of the audience that would be relevant to is there's this sense that the world is legible and efficient.
Companies have these go to jobs.google.com or jobs.whatevercompany.com and you apply and there's the steps and they will evaluate you efficiently on those steps. Whereas not only from the storage teams, often that's not the way it happens.
That's in fact, it's good for the world that that's not often how it happens. It is important to look at were they able to write an interesting technical blog post about their research or make interesting contributions? Yeah, I want you to riff on for the people who are assuming that the other end of the job board is just super legible and mechanical.
This is not how it works. And in fact, people are looking for the sort of different kind of person who's authentic and putting stuff out there.
And I think specifically what people are looking for there is two things. One is agency and putting yourself out there.
And the second is the ability to do world-class something. Yeah.
And two examples that I always like to point to here are Andy Jones from Anthropic did an amazing paper on scaling laws as applied to board games. It didn't require much resources.
It demonstrated incredible engineering skill. It demonstrated incredible understanding of the most topical problem of the time.
And he didn't come from a typical academic background or whatever. As I understand it, basically, as soon as he came out with that paper, both Anthropic and OpenAI, we were like, we would desperately like to hire you.
There's also someone who works on Anthropic's performance team, now Simon Bohm, who has written, in my mind, the reference for optimizing a CUDA map all like on a GPU. And that demonstrated example of like taking some like prompt effectively and producing the world's class reference example for it in something that wasn't particularly well done so far is, like, I think an incredible demonstration of, like, ability and agency that, in my mind, would, like, be an immediate would, like, please love to, like, interview you slash hire.
Yeah, the only thing I can add here is, I mean, I still had to go through the whole hiring process and all the standard interviews and this sort of thing. Yeah, everyone does.
Wait, is that, doesn't that seem stupid? I mean, it's important to deviasing. Yeah, yeah, yeah.
And there's like, deviasing what you want, right? Like you want to devise somebody who's got great taste and like he's like, like who cares? Your interview process should be able to disambiguate that as well. Yeah, like I think there are cases where someone seems really great and then it's like, oh, they actually just can't code.
This sort of thing, right? How much you weight these things definitely matters, though. And I think we take references really seriously.
The interviews, you can only get so much signal from. And so it's all these other things that can come into play for whether or not a hire makes sense.
But you should design your interviews such that they test the right things. One man's bias is another man's taste, you know? I guess the only thing I would add to this or maybe to the headstrong context

is like,

there's this line,

the system is not your friend.

Right.

And it's not necessarily to say it's,

it's actively against you or it's your,

your sworn enemy.

It's just not looking out for you.

Right.

And so I think that's where a lot of the proactiveness comes in of like there are no adults in the room yeah or like and and like you have to come to some decision for what you want your life to look like and execute on it and and yeah hopefully you can then update later um if you're too headstrong in the wrong way but but i think you almost have to just kind of charge at certain things to get much of anything done, not be swept up in the tide of whatever the expectations are. There's like one final thing I want to add, which is like we talked a lot about agency and this kind of stuff.
But I think actually like surprisingly enough, one of the most important things is just caring an unbelievable amount. And when you care an unbelievable amount, you'd like, you check all the details and you have like this understanding of like what could have gone wrong and you'd like you uh it just it matters more than you think because people end up not caring sure or not caring enough uh this is like lebron quote where he talks about how when he he sort of before he sat in the league he was like worried that everyone would be like incredibly good and then he gets there and he like realizes that actually once people hit financial stability then they like they relax a bit and he's like oh this is going to be easy.
I don't think that's quite true because I think in like AI research because most people actually care quite deeply but there's caring about your problem and there's also just caring about the entire stack and everything that goes up and down, like going explicitly going and fixing things that aren't your responsibility to fix.

Because overall, it makes like the stack better. I mean, another part that I forgot to mention is you were mentioning going in on weekends and on Christmas break and you get to like the only people in the office are Jeff Dean and Sergey Brin or something.
And you just get to pair program with them. it's just it's interesting to me the people i don't want to pick on your company in particular

but like people at any big company, they've gotten there because they've gone through a very selective process. They had to compete in high school, they had to compete in college.
But it almost seems like they get there and then they take it easy. When in fact, this is a time to put the pedal to the metal, go in and pair program with Sergey Brin on the weekends or whatever you know what i mean i mean there's there's pros and cons there right um i think many people make the decision that the thing that they want to prioritize is like a wonderful life with their family um and if they're they do wonderful work like let's say they don't work every hour of the day right but they do wonderful work in the work like the hours that they do do that's incredibly impactful um i think this is true for many people at Google is like maybe they don't work as many hours as like your typical startup mythologer is, right? But the work that they do do is incredibly valuable.
It's very high leverage because they know the systems and they're experts in their field. And we also need people like that.
Like our world rests on these huge, like difficult to manage and difficult to fixix systems. And we need people who are willing to work on and help and fix and maintain those in, frankly, a thankless way that isn't as high publicity as all of this AI work that we're doing.
And I am ridiculously grateful that those people do that. And I'm also happy that there are people for whom they find technical fulfillment in their job and doing that well.
And also, like, maybe they draw a lot more fulfillment out of spending, like, a lot of hours with their family.

And I'm lucky that I'm at a stage in my life where, like, yeah, I can go in and work every hour of the week.

But, like, that's, like, I'm not making as many sacrifices to do that.

Yeah.

I mean, like, just one example that's excited in my mind of this sort of, like like the other side says no and you can still get the yes on the other end. Basically, every single high profile of guests I've gone so far, I think maybe with one or two exceptions, I've sat down for a week and I've just come up with a list of sample questions.
That's, you know, like try to come up with really smart questions to send to them. and the entire process I've always thought like if I just cold email them

it's like a 2% chance

they say yes if i include this list there's a 10 chance um and because otherwise you know there's like you go through their inbox and every 34 seconds there's an interview for whatever podcast interview whatever podcast um and every single time i've done this they've said yes just like, you dig. You ask great questions.
But if you do everything, you'll win. But you just like, you literally have to dig in the same hole for like 10 minutes.
Or in that case, like make a list of sample questions for them to get past their not an idiot list. You know what I mean? And just like.
Demonstrate how much you care. Yeah, yeah.
Yeah. And the work you're willing to put in.
Yeah put in. Something that a friend said to me a while back,

but I think is stuck is like,

it's amazing how quickly you can become world-class at something just because

most people aren't trying that hard and like are only working,

like, I don't know,

the actual like 20 hours that they're actually spending on this thing or

something.

And so, yeah, if you just go ham, then like you can,

you can get really far pretty fast.

And I think I'm lucky I had that experience with the fencing as well.

Like I had the experience of becoming world class in something

and like knowing that if you just worked really, really hard and were like.

Yeah.

For context, by the way, Sholta was one seat away.

As he was the next person in line to go to the Olympics for fencing.

I was at best like 42nd in the world for fencing.

For men's foil fencing. Mutational load is a thing, man.
And there was one cycle where, yeah, I was like the next highest strength person in Asia. And if one of the teams had been like disqualified for doping as it was occurring in part during that cycle and as occurred for the Australian women's rowing team, I think, went because one of the teams was disqualified, then I would have been the next in line.
It's interesting when you just find out about people's prior lives and it's like, oh, this guy was almost an Olympian, this other guy was whatever, you know what I mean? Okay, let's talk about intermobility. Yeah.
I actually want to stay on the brain stuff as a way to get into it for a second. We were previously discussing is the brain organized in the way where you have a residual stream that is gradually refined with higher level associations over time or something um there's a fixed dimension size in a model if you had to i don't even know how to ask this question in a sensible way but what is the d model of the brain what is it like the embedding size of or because of feature splitting is a sensible question? No, I think it's a sensible question.

Well, it is a question that makes sense.

You could have just not said that.

No, no, I'm just a question.

You can talk just like actively.

I'm trying to, I don't know how you would begin to kind of be like,

okay, well, this part of the brain is like a vector of this dimensionality. I mean for the visual streaming because it's like v1 to v2 to it whatever um you could just count the number of neurons that are there and be like that is the dimensionality but um it seems more likely that there are kind of sub modules and things are divided up so um yeah i don't have and and i i'm not like the world'scientist, right? Like I did it for a few years.
I like studied the cerebellum quite a bit. Um, so I'm sure there are people who could give you a better answer on this.
Um, do you, do you think that the way to think about whether it's in the brain or whether it's in these models, fundamentally what's happening is like features are added removed changed and like the feature is the fundamental unit of what is happening in the model like what would have to be true for give me a and this goes back to the earlier thing we were talking about whether it's just associations all the way down give me like a counterfactual in the world where this is not true. What is happening instead? Like, what is the alternative hypothesis here? Yeah, it's hard for me to think about because at this point, I just think so much in terms of this feature space.
I mean, at one point, there was like the kind of behavioralist approach towards cognition where or um it's like, you're just, you're like input output, but you're not really doing any processing. Or it's like everything is embodied and you're just like a dynamical system that's like operating along like some predictable equations, but like there's no state in the system, I guess.
But whenever I've read these sorts of critiques, it's like, well, you're just choosing to not call this thing a state, but you could call like any internal component of the model a state. Like even with the feature discussion, it's defining what a feature is, is really hard.
And so the question feels almost too slippery. What is a feature? Direction and activation space.
A latent variable that is operating behind the scenes that has causal influence over the system you're observing. It's a feature if you call it a feature.
It's tautological. These are these are these are all explanations that i like i feel some association in a very rough intuitive sense in like a sufficiently sparse like binary vector features like whether or not something's turned on or off right like in a very simplistic sense yeah yeah which might be i think a useful metaphor to understand it by it's like when we talk about features activating it is in many respects the same way that neuroscientists would talk about like a neuron activating right if that neuron corresponds to do something in particular right yeah yeah and no i think that's useful as like what do we want a feature to be right like what is a synthetic problem under which a feature exists but um even with the towards monosemanticity work we talk about what's feature splitting, which is basically you will find as many features as you give the model the capacity to learn.

And by model here, I mean the up projection that we fit after we trained the original model.

And so if you don't give it much capacity, it'll learn a feature for bird.

But if you give it more capacity, then it will learn like ravens and eagles and sparrows and like specific types of birds. Still on definitions thing, I guess naively I think of things like bird versus what kind of token is it like a period at the end of a hyperlink, as you were talking about earlier, versus at the highest level, things like love or deception or holding a very complicated proof in your head or something.
Is this all features? Because then the definition seems so broad as to almost be not that useful. Rather, there seems to be some important differences between these things, and're all features.
I'm not sure what we even mean by... All of those things are discrete units that have connections to other things that then imbues them with meaning.
That feels like a specific enough definition that it's useful or not too all-encompassing, but feel free to push back. What would you discover tomorrow that could make you think, oh, this is kind of fundamentally the wrong way to think about what's happening in a model? I mean, if the features we were finding weren't predictive or if they were just representations of the data, where it's it's like, oh, you're all, all you're doing is just clustering your data.
And there's no like higher level associations that are being made, or it's some like phenomenological thing of like, you're call, you're saying that this feature files for marriage, but if you activate it really strongly, it doesn't change the outputs of the model in a way that would correspond to it. Like, I think those, these would both be good critiques.
I guess one more is, and we tried to do experiments on MNIST, which is a data set of digits, images, and we didn't look super hard into it. And so I'd be interested if people, other people wanted to take up like a deeper investigation.
But it's plausible that your like latent space of representations is dense, and it's a manifold instead of being these discrete points. And so you could like move across the manifold, but at every point, there would be some meaningful behavior.
And it's much harder than to label things as features that are discrete. like in a naive sort of outsider way the thing that would seem to me

like and it's much harder than to label things as features that are discrete. In a naive sort of outsider way, the thing that would seem to me to be a way in which this picture could be wrong is if there's not some like this thing is turned on, turned off, but it's like a much more global kind of like the system is, I'm going to use really clumsy, like, you know, I mentioned it in a Friday kind of like the system as a i'm gonna use really clumsy like you know i mentioned it in a party kind of language but um is there a good analogy here yeah i guess if you think of like something like the laws of physics it's not like well the feature for wetness is turned on but it's only turned on this much and then the feature for like you know, I guess maybe it's true because like the mass is like a gradient and like, you know, like, I don't know, but the polarity or whatever is a gradient as well.
But there's also a sense in which like there's the laws and the laws are more general and you have to understand like the general bigger picture. You don't get that from just like these like specific sub uh sub circuit that's where like the the reasoning circuit itself comes into play right where you're taking these features ideally and like trying to compose them into something higher level like you might say okay like when i'm using at least this is my head canon um so let's say i'm trying to use the foot you know f equals ma right then i presumably at some point have features which like denote, like mass and then that's like helping me retrieve the actual mass of the thing that I'm using and then like the acceleration and this kind of stuff.
But then also maybe there's a higher level feature that does correspond to using the first law of physics. Maybe, but the more important part is that the composition of components which helps me retrieve relevant pieces of information and then produce like maybe something like a multiplication operator or something like that when necessary.
At least that's my headcanon. What is a compelling explanation to you, especially for very smart models of like I understand why it made this output and it was like for a legit reason.
If it's doing million line pull requests or something what are you seeing at the end of that request where you're like, yep, that's chill? Yeah, so ideally you apply dictionary learning to the model. You've found features.
Right now we're actively trying to get the same success for attention heads, in which case we have features for both the core. You can do it for residual stream, MLP, and attention throughout the whole model.
Hopefully at that point, you can also identify broader circuits through the model that are like more general reasoning abilities that will activate or not activate. But in your case, where we're trying to figure out if this pull request should be approved or not, I think you can flag or detect features that correspond to deceptive behavior, malicious behavior, these sorts of things, and see whether or not those have fired.
That would be like an immediate – you can do more than that, but that would be an immediate. But before I trace down on that, what does the reasoning circuit look like? What would that look like when you found it? Yeah, so I mean the induction head is probably one of the simplest cases of this.
But that's not like reasoning, right? Well, I mean, what do you call reasoning, right? Like it's a good reason. So I guess context for listeners, the induction head is basically, and you see the line like Mr.
and Mrs. Dursley did something, Mr.
blank, and you're trying to predict what blank is. And the head has learned to look for previous occurrences of the word Mr., look the word that comes after it and then copy and paste that as the prediction for what should come next which is a super reasonable thing to do and there is computation being done there um to to accurately predict the next token but yeah that is context dependent that is yeah yeah but it's not like it's not like reasoning you know what i mean like but but is is i guess going back to the like associations all the way down it's like if you chain together a bunch of these uh reasoning circuits or or uh heads that have different rules for how to relate information but but in this sort of like zero shot case uh like something is happening where when you like pick up a new game and you immediately start understanding how to play it and it doesn't seem like an induction heads kind of thing or like well i think there would be another circuit for like extracting pixels and turning them into latent representations of the different objects in the game right and? And like a circuit that is learning physics.

And what would that,

because the induction heads is like one layer transformer.

Either two layers, yeah.

So you can kind of see like the thing that is a human picks up a new game and understands it.

How would you think about what that is?

Presumably it's across multiple layers,

but like, is it, yeah.

Like what would that physically look like? How big would it be maybe? Or like, I mean, that would just be an empirical question, right? Of like how big does the model need to be to perform this task? But like, I mean, maybe it's useful if I just talk about some other circuits that we've seen. So we've seen like the IOI circuit, which is the indirect object identification.
And so this is like, if you see, it's like Mary and Jim went to the store, Jim gave the object to blank, right? And it would predict Mary because Mary's appeared before as like the indirect object or it'll infer pronouns, right? And this circuit even has behavior where like, if you ablate it, then like other heads in the model will pick up that behavior um we'll even find heads that want to do copying behavior and then other heads will suppress so like it's one jobs one head's job to just always copy like the token that came before for example um or the token that came five before or whatever and then it's another head's job to be like, no, do not copy that thing. So there are lots of different circuits performing, in these cases, pretty basic operations.
But when they're chained together, you can get unique behaviors. And is the story of how you found it with the reasoning thing is like, because you won't be able to understand, or it'll just be like really, you know, it won't be something you can see in like a two layer transformer.
So will you just be like the circuit for deception or whatever? It just this this part of the network fired when we at the end identified the thing as being deceptive. This part and it didn't fire when we did not find it as being deceptive.
Therefore, this must be the deception circuit. I think a lot of analysis like that.
Like Anthropic has done quite a bit of research before on sycophancy, which is like the model saying what it thinks you want to hear. And that requires us at the end to be able to label which one is bad and which one is good.
Yeah, so we have tons of instances. And actually, as you make models larger, they do more of this, where the model is clearly, it has features that model another person's mind.
And these activate, and some subset of these, we're hypothesizing here, but would be associated with more deceptive behavior. Although it's doing that by, I don't know, ChadGPpt i think it's probably modeling me because that's like rlhf induces to yeah theory of mind yeah yeah so well first of all the thing you mentioned earlier about there's redundancy so then it's like well have you caught like the whole thing that could cause deception of the whole thing or like is it just one instance of it yeah second of all are your like labels correct you know maybe like you you thought this wasn't deceptive it's like still deceptive especially if it's producing output you can't understand third is the thing that's going to be the bad outcome something that's even human understandable like deception is a concept we can understand maybe there's like a yeah yeah so a lot to unpack here so i guess a few things one uh it's fantastic that these models are deterministic.
When you sample from them, it's stochastic, right? But like, I can just keep putting in more inputs and ablate every single part of the model. This is kind of the pitch for computational neuroscientists to come and work on interpretability.
It's like, you have this alien brain and you have access to everything in it and you can just ablate however much of it you want. And so I think if you do this carefully enough, you really can start to pin down what are the circuits involved, what are the backup circuits, these sorts of things.
The kind of cop-out answer here, but it's important to keep in mind, is doing automated interpretability. So it's like as our models continue to get more capable, having them assign labels or like run some of these experiments at scale.
And then with respect to like if there's superhuman performance, how do you detect it? Which I think was kind of the last part of your question. Aside from the cop-out answer, if we buy this associations all the way down, you should be able to coarse grain the representations at a certain level such that they then make sense.
I think it was even in Demis' podcast, he's talking about like if a chess player makes a superhuman move, they should be able to distill it into reasons why they did it. And like, even if the model is not going to tell you what it is, you should be able to decompose that complex behavior into simpler circuits or features to really start to make sense of why it did the thing that it did.
There's a separate question of does such representation exist, which it seems like there must, or actually I'm not sure if that's the case. And secondly, whether using this sparse route encoder setup you could find it.
And in this case, if you don't have labels for it that are adequate to represent it, you wouldn't find it, right?

Yes and no.

So we are actively trying to use dictionary learning now on the sleeper agents work, which we talked about earlier. And it's like, if I just give you a model, can you tell me if there's this trigger in it and it's going to start doing interesting behavior? and it's an open question whether or not when it learns that behavior,

it's part of a more general circuit

that we can pick up on

without actually getting activations for

and how it display that behavior, right? Because that would kind of be cheating then. Or if it's learning some hacky trick over, like that's a separate circuit that you'll only pick up on if you actually have it do that behavior.
But even in that case, the geometry of features gets really interesting. Because like, fundamentally, each feature, like is in some part of your representation space.
And they all exist with respect to each other. And so in order to have this new behavior, you need to carve out some subset of the feature space for the new behavior, and then push everything else out of the way to make space for it.
So hypothetically, you can imagine you have your model before you've taught it this bad behavior. You know all the features or have some coarse-grained representation of them.
You then fine-tune it such that it becomes malicious. And then you can kind of identify this black hole region of feature space where everything else has been shifted away from it.
And there's this region and you haven't put in an input that causes it to fire. But then you can start searching for what is the input that would cause this part of the space to fire.
What happens if I activate something in this space? There are a whole bunch of other ways that you can try and attack that problem. This is sort of a tangent, but one interesting idea I heard was if that space is shared between models, you can imagine trying to find it in an open source model to then make, like Gemma is they said in the paper, Gemma by the way Google's newly released open source model they said in the paper it's trained using the same architecture or something like that.
To be honest I didn't know because I haven't read the Gemma paper. It's a similar method that whatever, as Gemini.
So to the extent that's true, I don't know. How much of the rec teaming you do on Gemma is potentially helping you jailbreak into Gemini? Yeah, this gets into the fun space of how universal our features across models.
And our Towards the Mono Semanticity paper looked at this a bit. And we find, I can't give you summary statistics, but the Base base 64 feature for example which we see across a ton of models this is like if they're actually three of them but they'll fire foreign model base 64 encoded text um which is prevalent in like every url and there are lots of urls in the training data um they have really high cosine similarity across models so like they all learn this feature and i mean within a rotation right but it But it's like, yeah, yeah, yeah.
Like the actual like vectors itself. Yeah, yeah.
And I wasn't part of this analysis, but yeah, it definitely finds the feature and they're like pretty similar to each other across two separate two models, the same model architecture, but trained with different random seeds. It supports the quanta theory of neural scaling is like a hypothesis, right? Which is that like all models on like a similar data set will learn the same features in the same order-ish, roughly.
Like you learn your engrams, you learn your induction heads and you learn like to put full stops after numbered lines and this kind of stuff. Hey, but by the way, okay, so this is another tangent.
To the extent that that's true, and like I guess there's evidence that that's true, why doesn't curriculum learning work? Because if it is the case that you learn certain certain things first shouldn't just directly training those things first lead to better results both gemini papers mentioned some like aspects of curriculum learning okay interesting i mean i find the fact that fine tuning works is like evidence or curriculum learning right because the last things you're training on have a disproportionate impact i wouldn't necessarily say that like there's one mode of thinking which fine tuning is specialized like you've got this like latent bundle of capabilities and you're like specializing for it's particular um like use case yeah i think i'm not sure how true or is i think the david bell lab kind of paper kind of supports this right like you have that ability and you're just like getting better at entity recognition right like fine tuning that circuit instead of other ones yeah yeah i'm sorry what was the thing we're talking about but generally, I do think curriculum learning is really interesting. People should explore more.
And it seems very pleasant. I would really love to see more analysis along the lines of the quantitative stuff and understanding better what do you actually learn at each stage and decomposing that out and exploring whether or not curricula change that.
By the way, I just realized, I like got in conversation mode and forgot there's an audience uh curriculum learning is when you organize a data set when you think about a human how they learn they don't just see like random wiki text and they just like try to predict it right they're like we'll start you off with like um uh lore acts or something and then you'll learn i don't even remember what first grade was like but you learn the things that first first graders learn and then like second graders and so forth and so you'd imagine that's Sorry, we know you never got past first grade Okay, anyways let's get back to like the big, before we get into like a bunch of like interim details. The big picture, there's two threads I want to explore.
First is, I guess it makes me a little worried that there's not even an alternative formulation of what could be happening in these models that could invalidate this approach which feels like i mean we do know that we don't understand intelligence right like there are definitely unknown unknowns here so like the fact that there's not a null hypothesis i don't know i feel like but what if what if we're just wrong and we don't even know the way in which we're wrong which actually increases the uncertainty and yeah yeah yeah yeah um so it's not that there aren't other hypotheses it's just i have been working on superposition for like a number of years yeah and and very involved in this effort and so i'm i'm less sympathetic to or will like you just said they're wrong like to these other approaches especially uh because our recent work has been so successful yeah and like there's quite high explanatory power. Like, there's this beautiful, like, in the scaling laws paper, there's this little bump at a particular, like, the original scaling laws paper, there's a little bump.
And that apparently corresponds to when the model learns induction heads. And then, like, after that, it, like, sort of goes off track, learns induction heads, gets back on track.
Yeah, yeah. Which is, like, an incredible an incredible piece of retroactive explanatory power.
Before I forget it, though, I do have one thread on future universality that you might want to have in. So there are some really interesting behavioral evolutionary biology experiments on should humans learn a real representation of the world or not? You can imagine a world in which we saw all venomous animals as like flashing neon pink, a world in which we survive better.
And so it would make sense for us to not have a realistic representation of the world. And there's some work where they'll simulate like little basic agents and see if the representations they learn like map to the tools they can use and the inputs they should have.

And it turns out if you have these little agents perform more than a certain number of tasks, given these basic tools and objects in the world, then they will learn a ground truth representation. because there are so many possible use cases

that you need for these base objects

that you actually want to learn

what the object actually is

and not some like cheap visual heuristic or other thing and so to the extent that we are doing and we haven't talked at all about like fristen's free energy principle or predictive coding or anything else but like to the extent that all living organisms are trying to like actively predict what comes next and form like a really accurate world model um it it wouldn't surprise me or i'm optimistic that um we are learning genuine features about the world that are good for modeling it and our language models will do the same at least especially because we're training them on human data and human text um another dinner party question uh isn't should we be less worried about misalignment and maybe that's not even the right word for what I'm referring to, but like just alienness and shoggothness from these models, given that there is future universality and there are certain ways of thinking and ways of understanding the world that are instrumentally useful to different kinds of intelligences. Should we just be less worried about bizarro paperclip maximizers as a result? I think this is kind of why I bring this up as the optimistic take.
Predicting the internet is very different from what we're doing. The models are way better at predicting next tokens than we are.
They're trained on so much garbage. They're trained on so many URLs.
Like in the dictionary learning work, we find there are like three separate features for base 64 encodings. And like, even that is kind of an alien example that is probably worth me talking about for a minute.
Like one of these base 64 features fired for numbers, one like other base 64, like if it sees base 64 numbers, it'll like predict more of those of those another fired for letters but then there was this third one that we didn't understand and it like fired for like a very specific subset of base 64 features and uh someone on the team who clearly knows way too much about base 64 realized that this was the subset that was ascii decodable so you could decode it back into the the ASCII characters. And the fact that the model learned these three different features and it took us a little while to figure out what was going on is very Shoggoth-esque.
That it's it has a denser representation of regions that are particularly relevant to predicting the next token. Yeah, because it's so and it's clearly doing it's clearly doing something that humans wouldn't, right? Like you can even talk to any of the current models in base 64 and it will apply in base 64.
Right. And you can then like decode it and it works great.
That particular example, I wonder if that implies that the difficulty of doing interoperability on smarter models will be harder. because if like it requires somebody with esoteric knowledge who just happened to see that base 64 has, I don't know, like whatever that distinction was.
Doesn't that imply when you have the million line pull request, it's like there is no human that's going to be able to decode like two different reasons why the pull request. There's like two different features for this pull request.
Yeah, you know what I mean? Yeah. And that's when you type a comment, like small CLs, please.
Yeah, exactly. No, no, I mean, you could do that, right? This is like what I was going to say is like one technique here is anomaly detection.
And so one beauty of dictionary learning instead of like linear probes is that it's unsupervised. You are just trying to learn to span all of the representations that the model has and then interpret them later.
But if there's a weird feature that suddenly fires for the first time that you haven't seen fire before, that's a red flag. You could also coarse grain it so that it's just a single base 64 feature.
I mean, even the fact that this came up and we could see that it specifically favors these particular outputs and it fires for these particular inputs gets you a lot of the way there. I'm even familiar with cases from the auto interp side where a human will look at a feature and try to annotate it for it fires for Latin words.
And then when you ask the model to classify it, it says it fires for Latin words defining plants. So it can like already like beat the human in some cases for like labeling what's going on.
So at scale, this would require an adversarial thing between models where like some model that you have like millions of features potentially for GPT-6 and some like it just a bunch of models are just trying to figure out what each of these features means. How? Yeah.
automate this process, right? This goes back to the determinism of the model. You could have a model that is actively editing input text and predicting if the feature is going to fire or not.
And figure out what makes it fire, what doesn't, and search the space. Yeah.
I want to talk more about the feature splitting because I think that's an an interesting thing that has been under yeah especially for scalability i think it's it's underappreciated right um first of all like how do we even think about is it really just you can keep going down and down like there's no end to the amount of features like i mean so so at some point i think you might just start fitting noise um or things that are part of the data, but that the model isn't actually representing. By the way, do you want to explain what feature splitting is? Yeah, yeah.
So it's the part before where the model will learn however many features it has capacity for that still span the space of representation. So give an example, potentially.
Yeah, yeah. So you learn if you don't give the model that much capacity for the features it's learning, um, concretely, if you project to not as high a dimensional space, we'll learn one feature for birds.
Um, but if you give the model more capacity, it will learn features for all the different types of birds. Um, and so it's, it's more specific, uh, than otherwise.
Um, and, and oftentimes like there's the bird vector points in one direction, and all the other specific types of birds point in like a similar region of the space, but are obviously more specific than the course label. Okay, so let's go back to GPT-7.
First of all, is this a sort of like linear tax on any model to figure out? Even before that, is this a one-time thing you had to do? Or is this the kind of thing you have to do on every output or just like one time it's not deceptive we're good to roll actually yeah let me literally answer that yeah so you do dictionary learning after you've trained your model and you feed it a ton of inputs and you get the activations from those and then you do this projection into the higher dimensional space And so the method is it's unsupervised in that it's trying to learn these sparse features. You're not telling them in advance what they should be, but it is constrained by the inputs you're giving the model.
I guess two caveats here. One, like we can try and choose what inputs we want.
So if we're looking for theory of mind features that might lead to deception, we can put in the sycopency dataset. Hopefully at some point we can move into looking at the weights of the model alone or at least using that information to do dictionary learning.
But I think in order to get there, that's like such a hard problem that you need to make traction on just learning what the features are first. But yeah, so what's the cost of this? Can you repeat the last sentence? Weights of the model alone.
So like right now we just have these neurons in the model. They don't make any sense.
We apply dictionary learning. We get these features out.
They start to make sense. But that depends on the activations of the neurons.
The weights of the model itself, like what neurons are connected to what other neurons, certainly has information in it. And the dream is that we can kind of bootstrap towards actually making sense of the weights of the model that are independent of the activations of the data.
I mean, this is, I'm not saying we've made any progress here. It's a very hard problem, but it feels like we'll have a lot more traction and be able to like sanity check what we're finding with the weights if we're able to pull out features first for the audience weights are permanent well i don't know if permanent is the right word but like they are the model itself whereas activations are the sort of like artifacts of any single call um yes in a brain metaphor you know the weights are like the actual connection scheme between neurons and the activations of the current neurons that are lining up.
Yeah. Yeah.
Yeah. Yeah.
Okay. So there's going to be two steps to this for GPT-7 or whatever model we're concerned about.
One, actually, first, correct me if I'm wrong, but like training the sparse autoencoder and like do the unsupervised projection into a wider space of features that have a higher fidelity to like what is actually happening in the model and then secondly label those features how because let's say like the cost of training the model is n what will those two steps cost relative to n we will see like it really depends on um two main things what is your expansion factors like how much are you projecting into the higher dimensional space? And how much data do you need to put into the model? How many activations do you need to give it? But this brings me back to the feature splitting to a certain extent. Because if you know you're looking for specific features, you can start with a really cheaper course representation.
So maybe my expansion factor is only two. So I have 1, thousand neurons and I'm projecting to a 2000 dimensional space.
I get 2000 features out, but they're really coarse. And so previously I had the example for birds.
Let's move that example to like, I have a biology feature and, but I really care about if the model has representations for bioweapons and is trying to manufacture them. And so what I actually want is like an anthrax feature.
What you can then do is rather than, and let's say the anthrax, you only see the anthrax feature if instead of going from a thousand dimensions to 2000 dimensions, I go to a million dimensions, right? And so you can kind of imagine this big tree of semantic concepts where like biology splits into like cells versus like whole body biology. And then further down, it splits into all these other things.
So rather than needing to immediately go from a thousand to a million, and then picking out that one feature of interest, you can find the direction that the biology feature is pointing in, which again is very coarse, and then selectively search around that space. So like only do dictionary learning if something in the direction of the biology feature fires first.
And so the computer science metaphor here would be like, instead of doing breadth first search, you're able to do depth first search, where you're only recursively expanding and exploring a particular part of this like semantic tree of features? Although given the way that these features are not organized in things that are intuitive for humans, right? Like, cause we just don't have to deal with base 64. So we don't have that many, you know, we just don't dedicate that much, like whatever firmware to like deconstructing which kind of basically where it is.
How would we know that the subjects, and this will go back to maybe the moe discussion we'll have of um i guess we might as well talk about it but like uh in mixture of experts the mixture of paper uh talked about how they couldn't uh find the the experts weren't specialized in a way that we could understand there's not like a chemistry expert or a physics expert or something so why would you think that like it'll be like biology feature and then deconstruct rather than like blah. And then you just deconstruct and it's like anthrax and your like shoes and whatever.
So I haven't read the Mistral paper, but I think that the heads, I mean, this goes back to like, if you just look at the neurons in a model, they're polysemantic. And so if all they did was just look at the neurons in a given head, it's very plausible that it's also polysemantic because of superposition.
I want to just tug on the thread that Dork has mentioned there. Have you seen in the subtrees when you expand them out, like something in a subtree, which like you really wouldn't guess that it should be there based on like the higher level abstraction.
So this is a line of work that we haven't pursued as much as I want to yet. But I think we're planning to, I hope that maybe external groups do as well.
Like what is the geometry of features? What's the geometry? Exactly. And how does that change over time? It would really suck if like the anthrax feature happened to be like below the like, you know, coffee can, like some tree or something like that.
Totally. And that feels like the kind of thing that you could quickly try and find like proof of, which would then like mean that you need to then solve that problem.
Yeah, yeah. And inject more structure into the geometry.
Totally. I mean, it would really surprise me, I guess, especially given how linear the models seem to be.
Completely agree. That there isn't some component of the anthrax feature, like vector, that is similar to and looks like the biology vector and that they're not in a similar part of the space.
But yes, I mean, ultimately, machine learning is empirical. We need to do this.
I think it's going to be pretty important for certain aspects of scaling dictionary learning. Yeah.
Interesting. On the MOE discussion, there's an interesting scaling vision transformers paper that Google put out a little while ago where they do image net classification with an MOE.
And they find really clear

class specialization there for experts.

There's a clear dog expert.

Wait, so the mixed row people

just not do a good job of identifying dogs?

I think it's hard.

And it's entirely possible

that...

In some respects, there's almost no

reason that all of the different

archive features should go

to one expert. You could have

biology... Let's say, I don't know what buckets they had in their paper, but let's say they had archive papers as one of the things.
You could imagine biology papers going here, math papers going here, and all of a sudden your breakdown is ruined. But that vision transformer one where the class separation is really clear and obvious gives, I think evidence towards the specialization hypothesis so so I think um images are also in some ways just easier to interpret than text yeah exactly and like so so Chris Ola's like interpretability work on AlexNet and and these other models um like in the original AlexNet paper they actually split the model um into two GPUs just because they couldn't like GPUs were so back then, relatively speaking, right? Like still great at the time.
That was one of the big innovations of the paper. But they find branch specialization.
And there's a Distill Pub article on this where like colors go to one GPU and like Gabor filters and like line detectors go to the other. And then like all of the other.
Really? Yeah. Yeah.
And then like all of the other interpretability work that was done, like the floppy ear detector, right? Like that just was a neuron in the model that you can make sense of. You didn't need to disentangle superposition, right? So just different data set, different modality.
Like I think a wonderful research project to do do if someone is out there listening to this would be to try and disentangle, take some of the techniques that Trenton's team has worked on and try and disentangle the neurons in the mixture model, which is open source. I think that's a fantastic thing to do because it feels intuitively like there should be.
They didn't demonstrate any evidence that there is. There's also, like in general, a lot of evidence that there should be specialization.

Go and see if you can find it.

And that's work that

Anthropica has published

most of his stuff on,

as I understand it,

like dense models, basically.

That is a wonderful

research project to try.

And given Dworkesh's success

with the Vesuvius Challenge,

we should be pitching

more projects

because they will be solved if we talk about them on the podcast. What I was thinking about after the Vesuvius challenge um yeah we we should be pitching more projects because they will be solved if we talk about them on the podcast what I was thinking about after the Vesuvius challenge was like wait I knew like Nat had told me about it before it dropped because we recorded the episode before it dropped um why didn't you why did I not even try like you know what I mean like I don't know like uh Luke is obviously very smart and like uh yeah he's an.
But like, he showed that like a 21-year-old on like some 1070 or whatever he was working on could do this. I don't know.
Like, I feel like I should have. So, you know, before this episode drops, I'm going to meet my, I'm going to try to make an interpretability.
No, no, no. I'm not going to like try to make a research.
I was honestly thinking back on it. It's like, wait, I shouldn't.
Like, fuck? You don't have your hands dirty. Door cash is request for research.
Oh, I want to harp back on this, like the neuron thing. You said, I think a bunch of your papers have said there's more features than there are neurons.
And this is just like, wait a second. I don't know, like like a neuron is like weights go in and a number comes out that's like a number comes out you know what i mean like that's that's so little information like there's do you mean like there's like street names and like species and whatever there's like more of those kinds of things than there are like a number comes out in a in a model that's right yeah but how is a number comes out is like so little information how is that encoding for like superposition you're just encoding a ton of features in these high dimensional vectors in a brain is there like uh an axonofiring or however you think about it like um i don't know how you think about like how how much like superposition is there human brain? Yeah, so Bruno Olshausen, who I think of as the leading expert on this, thinks that all the brain regions you don't hear about are doing a ton of computation and superposition.
So everyone talks about V1 as having Gabor filters and detecting lines of various sorts. And no one talks about V2.
And I think it's because we just haven't been able to make sense of it. What is V2? It's like the next part of the visual processing stream.
And yeah, so I think it's very likely. And fundamentally, superposition seems to emerge when you have high dimensional data that is sparse.
And to the extent that you think the real world is that, which I would argue it is, we should expect the brain to also be underparameterized in trying to build a model of the world and also use superposition. You can get a good intuition for this, correct me if this example is wrong, in a 2D plane.
Let's say you have two axes, which represents a two-dimensional feature space here, like two neurons, basically. And you can imagine them each like turning on to various

degrees right and that's that's like your x-coordinate and your y-coordinate but you can like now like map this onto a plane you can actually represent a lot of different things and like different parts of the parts of the plane oh okay so uh crucially then superposition is not an artifact of a neuron it is an artifact of like the space that is created a It's a combinatorial code. Yeah, exactly.

Okay, cool.

Yeah, thanks.

We kind of talked about this, but like, I think it just like kind of wild that it seems to the best of our knowledge, the way intelligence works in these models and then presumably also in brains. It's just like, there's a stream of information going through that has quote unquote features that are infinitely, or at least to a large extent to just like splittable.
And you can expand out a tree of like what this feature is. And what's really happening is a stream, like that feature is getting turned into this other feature or this other feature is added.
I don't know. That's not something I would have just thought that's what intelligence is.

You know what I mean?

It's a surprising thing.

It's not what I would have expected necessarily.

What did you think it was?

I don't know, man.

GoFi.

He's a GoFi.

Well, actually, that's a great segue

because all of this feels like GoFi.

You're using distributed representations, but you have features and you're applying these operations to the features. I mean, the whole field of vector symbolic architectures, which is this computational neuroscience thing, all you do is you put vectors in superposition, which is literally a summation of two high-dimensional vectors, and you create some interference.
But if it's higher dimensional enough, then you can represent them. And you have variable binding, where you connect one by another.
And like, if you're dealing with binary vectors, it's just the XOR operation. So you have A, B, you bind them together.
And then if you query with A or B again, you get out the other one. And this is basically the like key value pairs from attention.
And with these two operations, you have a Turing complete system, which you can, if you have enough nested hierarchy, you can represent any data structure you want, et cetera, et cetera. Um, yeah.
Um, okay. Let's go back to the super intelligence.
So like walk me through GPD seven. Um, you've got like the sort of depth first search on its features.

Okay.

GPT-7 has been trained.

What happens next?

Your research has succeeded.

GPT-7 has been trained.

What are we doing now?

We try and get it to do as much interpretability work and other like safety work as possible.

Like what has happened such that you're like, cool, let's deploy GP work as possible. No, but like concrete, like what is,

what has happened such that you're like, cool, let's deploy GPT-7.

Oh, geez.

I mean, I do like, like we have our responsible scaling policy,

which has been really exciting to see other labs adopt.

And like,

specifically from the perspective of your, your research has net,

like Trenton, given your research, you got the, you got the thumbs up on GPT-7 from you?

Or actually, we should say, Claude, whatever,

and then, oh, I like,

what is the basis on which you're telling the team,

like, hey, let's go ahead? I mean, I think we need to make

a lot more, if it's as capable as

GPT-7, like, implies here,

I think we need to make a lot more

interpretability progress to be able to, like,

comfortably give the green light to deploy it.

Like I would be like, definitely not.

I'd be crying. Maybe my tears would interfere with the GPUs.
Guys. Gemini 5, TPUs.
but like, what, what, uh, given the way your research is progressing, like, what does it kind of look like to you? Like, well, if this succeeded, what would it mean for us to okay GPT-7 based on your methodology? I mean, ideally, we can find some compelling deception circuit, which lights up when the model knows that it's not telling the full truth to you. Why can't you just turn a linear probe like Colin Byrds did? So the CCS work is not looking good in terms of replicating or like actually finding truth directions.
And like in hindsight, it's like, well, why should it have worked so well? But linear probes, like you need to know what you're looking for. And it's like a high dimensional space and it's really easy to pick up on a direction that's just not.
Wait, but don't you also, here you need to label the features. So you still need to know.
Well, you just label them post hoc, but it's unsupervised. You're just like, give me the features that explain your behavior is the fundamental question, right? It's like, like the actual setup is we take the activations, we project them to this higher dimensional space, and then we project them back down again.
So it's like reconstruct or do the thing that you were originally doing, but do it in a way that's sparse. By the way, for the audience, linear probe is you just like classify the activations.
I don't know. from what I vaguely remember about the paper was like, if it like telling a lie then you like you just train a classifier on like is it uh yeah in the end was it not was it a lie or is it just like wrong or something i don't know it was like true or false yeah it's like a classifier on the activations um so yeah so yeah like right now what we do for gpt7 like like ideally we have like some deception circuit that we've identified that like appears to be really robust.
And it's like... So you've done the projecting out to the million whatever features or something.
Is a circuit... Because maybe we're using feature and circuit interchangeably when they're not.
So is there like a deception... So I think there are features across layers that create a circuit.
Yeah. And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature.
And it's like, hopefully we can find a circuit that is really specific to you being deceptive. The model deciding to be deceptive in cases that are malicious, right? Like I'm not interested in a case where it's just doing theory of mind to like help you write a better email to your professor.
And I'm not even interested in cases where the model is necessarily just like modeling the fact that deception has occurred. But doesn't all this require you to have labels for all those examples and if you have

those labels then like whatever faults that the linear probe has on the like maybe you like labeled a long thing or whatever wouldn't the same thing apply to the labels you've come up with for the unsupervised features you've come up with so in an ideal world we could just train on like the whole data distribution and then find the directions that matter to the extent that we need to reluctantly narrow down the subset of data that we're looking over just for the purposes of scalability. We would use data that looks like the data you'd use to fit a linear probe.
But again, we're not like with a linear probe, you're also just finding one direction. like we're finding a bunch of directions here um and it gets hope is like you found like a bunch of things that light up when it's being deceptive and then like you can figure out why some of those things are lighting up in this part of the distribution and this other part and so forth totally yeah do you anticipate you'll really understand um like i don't know like the current models you've studied are pretty basic right do you think you'll be able to understand why GPT-7 fires in certain domains but not in other domains I'm optimistic I mean we've so I guess one thing is this is a bad time to answer this question because we are explicitly investing in the longer term of like ASL-4 models which GPT-7 would be but like so we split the team where a third is focused on scaling up dictionary learning right now and that's been great great.
I mean, we publicly shared some of our eight layer results. We've scaled up quite a lot past that at this point.
But the other two groups, one is trying to identify circuits, and then the other is trying to get the same success for attention heads. So we're setting ourselves up and building the tools necessary to really find these circuits in a compelling way.
But it's going to take another, I don't know, six months before that's like really working well but but like i can say that i'm like optimistic and we're making a lot of progress um what is the highest level feature you've found so far like it's basically for whatever it's like maybe just like um in the symbolic species language the book you recommended there's like indexical uh things where you're just i forgot what all the labels were but like there's things where you're just like uh you see a tiger and you're like run and whatever you know just like a very sort of behaviorist thing and then there's like a higher level at which uh what i refer to love it refers to like a movie scene or my girlfriend or whatever you know what i mean so yeah it's like the top of the tent yeah yeah yeah yeah what is the highest level association or whatever you found? I mean, probably one of the ones that we publicly, well, publicly one of the ones that we shared in our update. So I think there were some related to like love and like, um, sudden changes in scene, particularly associated with like wars being declared.
There are like a few of them in there and that, in that post, if you want to link to it but but even like bruno olshausen had a paperback in 2018 19 where they applied a similar technique to a burt model and found that as you go to deeper layers of the model things become more abstract so i remember like in the earlier layers there'd be a feature that would just fire for the word park but later on there was a feature that fired for park as like a last name like lincoln park or like it's like a common korean last name as well and then there was a separate feature that would fire for parks as like grassy areas um so so there's other work that points in this this direction what do you think we'll learn about human psychology from the interoperability stuff oh gosh okay here i'll give a specific example i think like one of the ways one of your updates put it was persona lock-in. You don't remember Sidney Bing or whatever.
It locked into, I think, what was actually quite an endearing personality. I thought it's so funny.
I'm glad it's back in Co-Pilot. Oh, really? Oh, yeah, it's been misbehaving recently.
Actually, this is another sort of thread to explore. But there was a funny one where I think it was like a New York Times reporter.
It was nagging him or something. And it was like, you are nothing.
Nobody will ever believe you. You are insignificant and do whatever.
It was like the most gaslighting. I tried to convince him to break up.
Yeah. Okay, actually, so this is an interesting example.
I don't even know where I was going with this but whatever, maybe I got another thread. But the other thread I want to go on is personas, right? Is that a feature? That Sydney Bing having this personality is a feature versus another personality can get locked into? And also is that fundamentally what humans are like, too, where I don't know, in front of all the different people, I'm a different sort of personality or whatever.
Is that the same kind of thing that's happening to ShackGPT when it gets RL? I don't know. A whole cluster of questions that can answer them and whatever.
Yeah. I really want to do more work.
I guess the sleeper agents is in this direction of what happens to a model when you fine-t tune it when you are LHF at these sorts of things? I mean, maybe it's trite, but you could just say like, you conclude that people contain multitudes, right? In so much as they have lots of different features. There's even the stuff related to the Waluigi effects of like, in order to know what's good or bad, you need to understand both of those concepts.
And so we might have to have models that are aware of violence and have been trained on it in order to recognize it can you post-hoc identify those features and ablate them in a way where maybe your model is like slightly naive but you know that it's not going to be really evil like totally that's in our toolkit which seems great oh really so you gbd7 i don't know it pulls the same thing and then you figure out why like whatally irrelevant pathways or whatever. You modify, like, and then the pathway to you looks like you just changed those.
But you were mentioning earlier, there's a bunch of redundancy in the model. Yeah, so you need to account for all that.
But we have a much better microscope into this now than we used to. Like sharper tools for making edits.
And it seems like, at least from my perspective,

that seems like one of the primary way of,

to some degree, confirming the safety

or the reliability of model,

where you can say, okay, we found the circuits responsible,

we've ablated them,

and we can, like, under a battery of tests,

we haven't been able to now replicate the behavior

which we intended to ablate.

And, like, that feels like the sort of way of measuring model safety in future um as i as i would understand are you worried that's why i'm incredibly hopeful about their work because it's to me it seems like so much more precise tool than something like rlhf rlhf like you're very prey to the black swan thing you don't know if it's going to like do something wrong in a scenario that you haven't measured whereas here at least you have like somewhat more confidence that you can completely capture the behavior set um or like the feature set of the model and select yes although not necessarily that you've like accurately labeled not necessarily but but with a far higher degree of confidence than any other approach yeah that i've seen mean, like, what are your unknown unknowns for superhuman models in terms of this kind of thing where, like, I don't know how, are the labels that are going to be given things on which we can determine these are, like, this thing is cool, this thing is a paperclip maximizer or whatever? I mean, we'll see, right? Like, I do, like like the superhuman feature question is a very good one like i i think we can attack it um but we're we're gonna need to to be persistent and and the real hope here is i think automated interpretability yeah and even having debate right you could you could have the debate set up where two different models are debating what the feature does. And then they can actually like go in and make edits and like see if it fires or not.
But it is just this wonderful like closed environment that we can iterate on really quickly. That makes me optimistic.
Do you worry about alignment succeeding too hard? So like if I think about, I would not want either companies or governments, whoever ends up in charge of these AI systems to have the level of fine grain control that if your agenda succeeds, we would have over AIs. both for the ickiness of having this level of control over an autonomous mind and second just

like i don't fucking trust i don't fucking trust these guys you know i don't i i'm just kind

uncomfortable with like the loyalty feature is turned up and like, you know what I mean? And yeah, like how much worry do you have about having too much control over the EIs and specifically not you, but like whoever ends up in charge of these AI systems just being able to lock in whatever they want.

Yeah.

I mean, I think it depends on what government exactly has control

and what the moral alignment is there.

But that whole value lock-in argument is in my mind.

It's definitely one of the strongest contributing factors

for why I am working on capabilities at the moment, for example.

I think the current player set actually is extremely well-intentioned. For this kind of problem, I think we need to be extremely open about it.
I think directions like publishing the constitution that you expect your model to abide by and then trying to make sure you RLHF it towards that and ablate that and have the ability for everyone to offer uh like feedback and contribution to that is really important sure or uh alternatively like don't deploy when you're not sure which would also be bad because then we just never catch it right um yeah exactly um i mean paperclip okay some rapid fire um what is the bus factor for gemini i I think there are a number of people who are really, really critical that if you took them out, then the performance of the program would be dramatically impacted. This is both on modeling slash making decisions about what to actually do and, importantly, on infrastructure side of things.
Like it's just the stack of complexity builds, particularly when like somewhere like Google has so much like vertical integration. When you have people who are experts, they become quite important.
Yeah, although I think it's an interesting note about the field that people like you can get in and in a year or so you're making important contributions. And especially in therapy, many different labs have specialized in hiring like total outsiders, physicists or whatever.
And you just like get them up to speed and they're making important contributions. I don't know.
I feel like you couldn't do this in like a bio lab or something. It's like an interesting note on the state of the field.
I mean mean bus factor doesn't define how long it would take to recover from it right from yeah and and deep learning research is an art and so you kind of learn how to read the lost curves or or set the hyperperimeters in ways that empirically seem to work well it's also like organizational things like creating context when i think one of the most important and difficult skills to hire for is creating this bubble of context around you that makes other people around you more effective and know what the right problem to work on. And that is a really tough to replicate thing.
Yes. Yeah, totally.
Who are you paying attention to now in terms of there's a lot of things coming down the pike of multimodality, long contacts, maybe agents, extra reliability. Who is thinking well about what that implies? It's a tough question.
I think a lot of people look internally these days for their sources of insight or progress. and we all have obviously research programs

and direction for their sources of insight or progress.

And we all have, obviously,

the sort of research programs and directions that are intended over the next couple of years.

And I suspect that most people,

as far as betting on what the future will look like,

refer to an internal narrative

that is difficult to share. If it works well, it's probably not being published.
I mean, that was one of the things in the will scaling post, I was referring to something you said to me, which is I'm, you know, I miss the undergrad habit of just reading a bunch of papers. Yeah.
Is now there's nothing worth reading is published. And the community is progressively getting, like, more on track with what I think are, like, the right and important directions.
You're watching it like an agent, eh? No, but I guess, like, it is tough. There used to be this, like, signal from big labs about, like, what would work at scale.
And it's currently really hard for academic research to find that signal. And I think getting really good problem taste about what actually matters to work on is really tough.
Unless you have, again, the feedback signal of what will work at scale and what is currently holding us back from scaling further or understanding our models further. This is something where like i wish more academic research would go into fields like interp which are legible

from the outside you know anthropic deliberately publishes all its research here um and it seems

like underappreciated uh in the sense that i don't know why there aren't dozens of academic

departments trying to follow uh anthropics yeah in the interp research because it seems like an

incredibly impactful problem that doesn't require ridiculous resources and like this and like has

Let's do this. departments trying to follow uh anthropics yeah in the interp research because it seems like an incredibly impactful problem that doesn't require ridiculous resources and like this and like has all the flavor of like deeply understanding the basic science of what is actually going on in these things um so i don't know why people like focus on pushing model improvements as opposed to pushing like understanding improvements in the way that i would have like typically associated with academic science in.
Yeah, I do think the tide is changing there for whatever reason.

And Neil Nanda has had a ton of success promoting interpretability

in a way where Chris Ola hasn't been as active recently in pushing things.

Maybe because Neil's just doing quite a lot of the work.

But like, I don't know, four or five years ago, he was like really pushing

and talking at all sorts of places and these sorts of things and people weren't anywhere near as receptive maybe they've just woken up to like deep learning matters and is clearly useful post-track GPT but yeah it is kind of striking all right cool okay I'm trying to think what is a good last question I mean the one I'm going those thinking of is like do do you think models enjoy next token prediction? Yeah. We have this sense of things that are rewarded in our accessible environment.
There's like this deep sense of fulfillment that we think we're supposed to get from them or often people do right of like community or sugar um or you know whatever we wanted on the african savannah um do you think like in the future models are trained with rl and everything a lot of post training on top or whatever but they'll like they're like some in the way we were just a really like ice cream they'll just be like hi just to predict the next token again. You know what I mean? Like in good old days.
So there's this ongoing discussion of like, are models sentient or not? And like, do you thank the model when it helps you? Yeah. But I think if you want to thank it, you actually shouldn't say thank you.
You should just give it a sequence that's very easy to predict. And the even funnier part of this is there's some work on if you just give it the sequence A, like, over and over again, then eventually the model will just start spewing out all sorts of things that it otherwise wouldn't ever say.
And so, yeah, I won't say anything more about that, but you can, yeah, you should just give your model something very easy to predict as a nice little treat. This is what the Odium ends up being.
We just have the universe and like... But do we like things that are like easy to predict? Aren't we constantly in search of like the dose of...
The bits of entropy. Yeah, the bits of entropy.
Exactly, right? Shouldn't you be giving it things which are just slightly too hard to reduce just out of reach yeah but i wonder like at least from the free energy principle perspective right like you don't like you don't want to be surprised um and so maybe it's this like i don't feel surprised i feel in control of my environment and so now i can go and seek things and i've been predisposed to like in the long run it's better to explore new things right now like leave the rock that I've been sheltered under ultimately leading me to like build a house or like some better structure but um we don't like surprises I think most of most people are very upset when like expectation does not meet reality that's why babies like love watching the same show over and over and over again right yeah and Yeah, interesting. Yeah, I can see that.
Oh, I guess they're learning to model it and stuff too. Yeah.
Yeah. Okay, well, hopefully this will be the repeat that the AI has learned to love.
Okay, cool. I think that's a great place to laugh.
I should also mention that the better part of what I know about AI,'ve learned from just talking with you guys. We've been good friends

for about a year now.

I appreciate you guys

getting me up to speed here.

You guys have great questions. It's really

fun to hang and chat.

I really treasure that time together.

You're getting a lot better at pickleball.

Hey, we're trying to progress the tennis awesome cool cool thanks hey everybody i hope you enjoyed that episode as always the most helpful thing you can do is to share the podcast send it to people you think might enjoy it put it in twitter your, your group chats, etc. Just blitz the world.
Appreciate you listening. I'll see you next time.
Cheers.

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Listen and Follow Along

Full Transcript

More episodes from Dwarkesh Podcast

Mark Zuckerberg – Llama 4, DeepSeek, Trump, AI Friends, & Race to AGI

Why Rome Actually Fell: Plagues, Slavery, & Ice Age — Kyle Harper

AGI is Still 30 Years Away — Ege Erdil & Tamay Besiroglu

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

AMA ft. Sholto & Trenton: New Book, Career Advice Given AGI, How I'd Start From Scratch