Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast.
No way to summarize it, except:
This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them.
You would be shocked how much of what I know about this field, I've learned just from talking with them.
To the extent that you've enjoyed my other AI interviews, now you know why.
So excited to put this out. Enjoy! I certainly did :)
Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform.
There's a transcript with links to all the papers the boys were throwing down - may help you follow along.
Follow Trenton and Sholto on Twitter.
Timestamps
(00:00:00) - Long contexts
(00:16:12) - Intelligence is just associations
(00:32:35) - Intelligence explosion & great researchers
(01:06:52) - Superposition & secret communication
(01:22:34) - Agents & true reasoning
(01:34:40) - How Sholto & Trenton got into AI research
(02:07:16) - Are feature spaces the wrong way to think about intelligence?
(02:21:12) - Will interp actually work on superhuman models
(02:45:05) - Sholto’s technical challenge for the audience
(03:03:57) - Rapid fire
Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Press play and read along
Transcript
Speaker 1 Okay, today I have the pleasure to talk with two of my good friends, Shoto and Trenton. Shoto.
Speaker 1 I wasn't gonna say anything.
Speaker 1 Let's do this in reverse.
Speaker 1 Yeah, I'm gonna go at 1.5.
Speaker 1 The context lining is wow.
Speaker 1 Shit. Anyways,
Speaker 1 Shoto, Noam Brown.
Speaker 1 Noam Brown, the guy who wrote the diplomacy paper, he said this about Sholto.
Speaker 1 He said, he's only been in the field for 1.5 years, but people in AI know that he was one of the most important people behind Gemini success.
Speaker 1 And Trenton, who's an anthropic, works on mechanistic interpretability. And it was widely reported that he has solved alignment.
Speaker 1 Which is one thing random Twitter on.
Speaker 1 So this will be a capabilities only podcast. Alignment is already solved, so no need to discuss further.
Speaker 1 Okay, so let's start by talking about context links. Yep.
Speaker 1 It seemed to be underhyped given how important it seems to me to be that you can just put a million tokens into context.
Speaker 1 There's apparently some other news that got pushed to the front for some reason. But
Speaker 1 yeah,
Speaker 1 tell me about how you see the future of long context links and what that implies for these models.
Speaker 1 Yeah, so I think it's really underhyped because until I started working on it, I didn't really appreciate how much of a step up in intelligence it was for the model to have the onboarding problem basically instantly solved.
Speaker 1 And you can see that a little bit in the perplexity graphs in the paper, where just throwing millions of tokens worth of context about a code base allows it to become dramatically better at predicting the next token in a way that you'd normally associate with huge increments in model scale.
Speaker 1 But you don't need that. All you need is like a new context.
Speaker 1 So underhyped and buried by some other news. In context, are they as sample efficient and smart as humans?
Speaker 1 I think that's really worth exploring. Because, for example, one of the evals that we did in the paper
Speaker 1 has it learning a language in context better than a human expert could learn that new language over the course of a couple months.
Speaker 1 And this is only like a pretty small demonstration, but I'd be really interested to see things like Atari games or something like that, where you throw in a couple hundred or thousand frames, labeled actions, and then in the same way that you'd show your friend how to play a game and see if it's able to reason through.
Speaker 1 It might, at the moment, you know, with the infrastructure and stuff, it's still a little bit slow slow at like doing that. But I would actually,
Speaker 1 I would guess that might just work out of the box in a way that would be pretty mind-blowing. And crucially, I think this language was esoteric enough that it wasn't in the training days.
Speaker 1
Right, exactly. Yeah.
If you look at the model before it has that context thrown in, it doesn't know the language at all and it can't get any translations.
Speaker 1
And this is like an actual human language, not just... Yeah, exactly.
An actual human language. So if this is true, it seems to me that these models are already, in an important sense, superhuman.
Speaker 1 Not in the sense that they're smarter than us, but I can't keep a million tokens in my context when I'm trying to solve a problem, remembering and integrating all the information in an entire code base.
Speaker 1 Am I wrong in thinking this is like a huge unlock? I actually genuinely think that's true.
Speaker 1 Like previously, I've been frustrated when models aren't as smart. Like you ask them a question and you want it to be smarter than you or to know things that you don't.
Speaker 1 And this allows them to know things that you don't in a way that it just ingests a huge amount of information in a way you just can't.
Speaker 1 So yeah, it's extremely important.
Speaker 1 How do we explain in context learning? Yeah. So there's a piece of, there's a line of work I quite like where it looks at in context learning as
Speaker 1 basically very similar to gradient descent, but
Speaker 1 the attention operation can be viewed as gradient descent on the in-context data.
Speaker 1 That paper had some cool plots where it basically showed we take n steps of gradient descent, and that looks like n layers of in-context learning, and it looks very similar.
Speaker 1 So I think that's one way of viewing it and trying to understand what's going on. Yeah.
Speaker 1 And you can ignore what I'm about to say because given the introduction, alignment is solved and safety isn't a problem. But I think the context stuff does get problematic, but also interesting here.
Speaker 1 I think there'll be more work coming out in the not too distant future
Speaker 1 around what happens if you give a hundred shot prompt for Joe Break's adversarial attacks.
Speaker 1 It's also interesting in the sense of if your model is doing gradient descent and learning on the fly,
Speaker 1 even if it's been trained to be harmless,
Speaker 1 you're dealing with a totally new model in a way. You're like fine-tuning, but in a way where you can't control what's going on.
Speaker 1 Can you explain what do you mean by gradient descent is happening in the forward pass and attention? Yeah,
Speaker 1
no, no, no. There was something in the paper about trying to teach the model to do linear regression.
Right. But just through the number of samples they gave in the context.
Speaker 1 And you can see, if you plot on the x-axis, like number of shots that it has or examples, and then the loss it gets on just ordinary least squares regression that will go down with time.
Speaker 1 And it goes down exactly matched with the number of gradient descent steps. Yeah, exactly.
Speaker 1 Okay.
Speaker 1 I only read the interim discussion section of that paper, but in the discussion, the way they framed it is that
Speaker 1 in order to get better at long context
Speaker 1 tasks, the model has to get better at learning to learn from these examples or from the context that is already within the window. And the implication of that is
Speaker 1 the model learned, if like meta-learning happens because it has to learn how to get better at long context tasks, then in some important sense,
Speaker 1 the task of intelligence requires long context examples and long context training.
Speaker 1 Like meta-learning, like you have to induce meta-learning, like understanding how to better induce meta-learning in a pre-training process is a very important thing to actually about flexible or adaptive intelligence.
Speaker 1 Right, but you can proxy for that just by getting better at doing long context tasks.
Speaker 1 One of the bottlenecks for AI progress that many people identify is the inability of these models to perform
Speaker 1 tasks on long horizons, which means engaging with the task for many hours or even many weeks or months where like if I have, I don't know, an assistant or an employee or something, they can just do a thing and tell them for a while.
Speaker 1 And AI agents haven't taken off for this reason from what I understand.
Speaker 1 So how linked are long context windows and the ability to perform well on them and the ability to do these kinds of long horizon tasks that require you to engage with
Speaker 1 an assignment for many hours, or are these unrelated concepts?
Speaker 1 I mean, I would actually take issue with that being the reason that agents haven't taken off, where I think that's more about like nines of reliability and the model actually successfully doing things.
Speaker 1 If you just can't chain tasks successfully with high enough probability, then you won't get something that looks like an agent.
Speaker 1 And that's why something like an agent might follow more of a step function in sort of usually. Like GPT-4 class models, Gemini Ultra class models, they're not enough.
Speaker 1 But maybe like the next increment on the model scale means that you get that extra nine, even though the loss isn't going down that dramatically, that small amount of extra ability gives you the extra.
Speaker 1 And yeah, obviously you need some amount of context to fit long horizon tasks, but I don't think that's been the limiting factor up to one.
Speaker 1 Yeah.
Speaker 1 The NERIPS best paper this year by Ryland Schaefer, who was the lead author, points to this as like the emergence mirage, where people will have a task and you get the right or wrong answer depending on if you've sampled the last five tokens correctly.
Speaker 1 And so naturally, you're multiplying the probability of sampling all of those.
Speaker 1 And if you don't have enough nines of reliability, then you're not going to get emergence. And all of a sudden, you do.
Speaker 1 And it's like, oh my gosh, this ability is emergent when actually it was kind of almost there to begin with. And there are ways that you can find like a smooth metric for that.
Speaker 1 Yeah, human eval or whatever, in the GPT-4 paper, the coding problems, they measure. Log costs, right? Exactly.
Speaker 1 For the audience, the context on this is
Speaker 1 it's basically the idea is you want to,
Speaker 1 when you're measuring how much progress there has been on a specific task, like solving coding problems,
Speaker 1 you upweight it when it gets it right only one in a thousand times. You don't give it a one in a thousand score because it's like, oh, like, gotta write some of the time.
Speaker 1 And so the curve you see is like it gets it right one in a thousand, then one in a hundred, then one in ten, and so forth.
Speaker 1
So, actually, I want to follow up on this. So, if your claim is that the AI agents haven't taken off because of reliability rather than long horizon task performance.
Isn't the
Speaker 1 lack of reliability when a task is changed on top of another task, on top of another task, isn't that exactly the difficulty with long horizon tasks?
Speaker 1 Is that like you have to do 10 things in a row or 100 things in a row and diminishing the reliability of any one of them
Speaker 1 or yeah, the probability goes down from 99.99 to 99.9, then like the whole thing gets multiplied together and the whole thing becomes much less likely to happen.
Speaker 1 That is exactly the problem, but the key issue you're pointing at there is that your base past like task solve rate is 90%.
Speaker 1 And if it was 99%, then chaining them doesn't become a problem.
Speaker 1
But also just like a second. Yeah, exactly.
And I think this is also something that just hasn't been properly studied enough.
Speaker 1 If you look at all of the evals that are commonly, like the academic evals are a single problem, right?
Speaker 1 You know, like the math problem, it's like one, like typical math problem or MMOU. It's like one university level
Speaker 1 problem from across different topics.
Speaker 1 You are beginning to start to see evals looking at this properly via more complex tasks like SuiBench, where they take a whole bunch of GitHub issues.
Speaker 1 And that is like a reasonably long horizon task, but it's still not a multi, it's like a multi sub-hour as opposed to like multi-hour or multi-day task.
Speaker 1 And so I think one of the things that will be really important to do over the next, however long, is understand better what does success rate over long horizon tasks look like.
Speaker 1 And I think that's even important to understand what the economic impact of these models might be and like actually properly judge increasing capabilities by like cutting down the tasks that we do and the inputs and outputs involved into minutes or hours or days, and seeing how good it is at successively chaining and completing tasks at those different resolutions of time.
Speaker 1 Because then that tells you how automatable the job family or task family is in a way that MMOU scores don't.
Speaker 1 I mean, it was less than a year ago that we introduced 100K context windows. And I think everyone was pretty surprised by that.
Speaker 1
So yeah, everyone just kind of had this soundbite of quadratic attention costs. And we can't have long context windows.
And
Speaker 1 here we are. So, yeah, like the benchmarks are being actively made.
Speaker 1 Wait, wait. So, doesn't the fact that there's these companies, Google and I don't know, Magic, maybe others, who have million token attention imply that the quadratic you shouldn't say anything.
Speaker 1
But doesn't that like imply that it's not quadratic anymore? Or are they just eating the cost? Interestingly. Like, who knows what Google is doing for its long context? Right.
I'm not making any.
Speaker 1 One of the things that's frustrated me about
Speaker 1 the general general research field's approach to attention is that there's an important way in which the quadratic cost of attention is actually dominated in typical dense transformers by the MLP block.
Speaker 1 You have this n-squared term that's associated with attention, but you also have an n-squared term that's associated with the D-model, the residual stream dimension of the model.
Speaker 1 And if you look, I think Sasha Rush has a great tweet where he looks like basically plots the curve of the cost of attention respectively, like the cost of really large models.
Speaker 1 And attention actually trails off.
Speaker 1 And you actually actually need to be doing pretty long contexts before
Speaker 1 that term becomes really important.
Speaker 1 And the second thing is that people often talk about how attention at inference time is such a huge cost.
Speaker 1 And if you think about when you're actually generating tokens, the operation is not n squared.
Speaker 1 It is one q like one set of q vectors looks up a whole bunch of kv vectors and that's linear with respect to the amount of like context that the model has.
Speaker 1 And so I think this drives a lot of the recurrence and state space research where people have this meme of, oh, like linear attention and all this stuff.
Speaker 1 And as Trendon said, there's like a graveyard of ideas around attention.
Speaker 1 And not to think it's worth exploring, but I think it's important to consider where the actual strengths and weaknesses of it are.
Speaker 1 Okay, so what do you make of this take?
Speaker 1 As we move forward through the takeoff, more and more of the learning happens in the forward pass. So originally, like all the learning happens in the backward,
Speaker 1 you know, during like this like bottom-up sort of hill climbing evolutionary process.
Speaker 1 If you think in the limit during the intelligence explosion, it just like the AI is like maybe handwriting the weights or like doing go-fi or something.
Speaker 1
And we're in like the middle step where like a lot of learning happens in context now with these models. A lot of it happens within the backward process.
Does this seem like a meaningful
Speaker 1 gradient along which progress is happening?
Speaker 1 Because the broader thing being the um if you're learning in the forward path it's like much more sample efficient because you can kind of like basically think as you're learning like when humans when you read a textbook you're not just skimming it and trying to absorb what you know what inductive these these words follow these words you like read it and you think about it and then you read some more you think about it
Speaker 1 I don't know does this seem like a sensible way to think about the progress
Speaker 1 yeah it may just be one of the ways in which like
Speaker 1 you know, birds and planes like fly, but they fly slightly differently. And like the virtue of technology allows us to do that, like
Speaker 1 basically accomplish things that birds can't.
Speaker 1 It might be that context length is similar in that it allows it to have a working memory that we can't, but functionally is not the key thing towards actual reasoning.
Speaker 1 The key step between GPT-2 and GPT-3 was that all of a sudden, there was this meta-learning behavior that was observed in training, like in the pre-training of the model.
Speaker 1 And that's, as you said, it's something to do with you give it some amount of context, it's able to adapt to that context. And that was a behavior that wasn't really observed before that at all.
Speaker 1 And maybe that's a mixture of property of context and scale and this kind of stuff, but it wouldn't have occurred to model tiny context, I would say.
Speaker 1 This is actually an interesting point. So, when we talk about scaling up these models, how much of it comes from just making the models themselves bigger?
Speaker 1 And how much comes from the fact that during any single call,
Speaker 1 you are using more compute? So, if you think of diffusion, you can just iteratively keep adding more compute. And if adaptive compute is solved, you can keep doing that.
Speaker 1 And in this case, if there's a quadratic penalty for attention, but you're doing long context anyways, then you're still dumping in more compute during, not during training, not during having bigger models, but just like, yeah.
Speaker 1 Yeah, it's interesting because you do get more forward passes by having more tokens. Right.
Speaker 1 My one gripe, I guess I have two gripes with this though, maybe three. So one, like
Speaker 1 in the alpha world paper,
Speaker 1 one of the transformer modules, they have a few and the architecture is very intricate.
Speaker 1 But they do, I think, five forward passes through it and will gradually refine their solution as a result.
Speaker 1 You can also kind of think of the residual stream.
Speaker 1 I mean, Shalto alluded to kind of the read-write operations as a poor man's adaptive compute, where it's like, I'm just going to give you all these layers. And if you want to use them, great.
Speaker 1 If you don't, then that's also fine.
Speaker 1 And then people will be like, oh, well, the brain is recurrent and you can do however many loops through it you want. And I think to a certain extent, that's right.
Speaker 1
If I ask you a hard question, you'll spend more time thinking about it. And that would correspond to more forward passes.
But
Speaker 1 I think there's a finite number of forward passes that you can do. It's kind of with language as well.
Speaker 1 People are like, oh, well, human language can have infinite recursion in it, like infinite nested statements of the boy jumped over the bear that was doing this, that had done this, that had done that.
Speaker 1 But empirically, you'll only see five to seven levels of recursion,
Speaker 1 which kind of relates to whatever that magic number of how many things you can hold in working memory at any given time is.
Speaker 1 And so, yeah, it's not infinitely recursive, but like, does that matter in the regime of human intelligence? And, like, can you not just add more layers?
Speaker 1 Breakdown for me, you were referring to this in some of your previous answers:
Speaker 1 listen, you have these long contexts, and you can hold more things in memory, but like ultimately comes down to your ability to mix concepts together, to do some kind of reasoning.
Speaker 1 And
Speaker 1 these models aren't necessarily human-level at that, even in context. Break down for me how you see storing just raw information versus reasoning and what's in between.
Speaker 1 Like, where is the reasoning happening?
Speaker 1 Where is this like storing raw information happening? What's different between them in these models?
Speaker 1 Yeah,
Speaker 1 I don't have a super crisp answer for you here.
Speaker 1 I mean, obviously, with the input and output of the model, you're mapping back to actual tokens, right? And then in between that, you're doing higher-level processing.
Speaker 1 Before we get deeper into this, we should explain to the audience, you referred earlier to Anthropic's way of thinking about transformers as these read-write operations that layers do.
Speaker 1 One of you should just kind of explain at a high level what you mean by that. So the residual stream, imagine you're in a boat going down a river.
Speaker 1 And the boat is kind of the current query where you're trying to predict the next token. So it's the cat sat on the blank.
Speaker 1 And
Speaker 1 then you have these little streams that are coming off the river where you can get extra passengers or collect extra information if you want. And those correspond to the attention heads and MLPs
Speaker 1 that are part of the model.
Speaker 1 I almost think of it as the working memory of the model, like the RAM of the computer where you're choosing what information to read.
Speaker 1 So you can do something with it, and then maybe you read something else in later on.
Speaker 1 And you can operate on subspaces of that high-dimensional vector.
Speaker 1 A ton of things are, I mean, at this point, I think it's almost given that are encoded in superposition.
Speaker 1 So it's like, yeah, the residual stream is just one high-dimensional vector, but actually, there's a ton of different vectors that are packed into it. Yeah.
Speaker 1 I might just dumb it down as a way that would have made sense to me a few months ago of, okay, so you have, you know, whatever words are in the input you put into the model.
Speaker 1
All those words get converted into these tokens, and those tokens get converted into these vectors. And basically, just like this small amount of information that's moving through the model.
And
Speaker 1 the way you explained it to me, Sholda, that this paper talks about is early on in the model, maybe it's just doing some very basic things about like, what do these tokens mean?
Speaker 1 Like if it says like 10 plus five, just like moving information about to
Speaker 1
have that good representation. Exactly, just represent.
And in the middle, maybe like the deeper thinking is happening about like how to think, yeah, how to solve this.
Speaker 1 At the end, you're converting it back into the output token because the end product is you're trying to predict the probability of the next token from the last of those residual streams.
Speaker 1 And so, yeah, it's interesting to think about like just like the small compressed amount of information moving through the model and it's like getting modified in different ways.
Speaker 1
Trend, so you're it's interesting. Uh, you're one of the few people who have like background from neuroscience.
You can think about the analogies here
Speaker 1 to
Speaker 1 the brain. And in fact, I have one of our friends, the way he had a paper in grad school about thinking about attention in the the brain and he said this is the only or first
Speaker 1 neural explanation of why attention works whereas we have evidence from why the CNNs work convolutional neural network networks work based on the visual cortex or something
Speaker 1 Yeah, I'm curious, do you think in the brain there's something like a residual stream of this compressed amount of information that's moving through and it's getting modified as you're thinking about something?
Speaker 1 Even if that's not what's literally happening, do you think that's a good metaphor for what's happening in the brain? Yeah, yeah.
Speaker 1 So at least in the cerebellum, you basically do have a residual stream where the whole, what we'll call the attention module for now, and I can go into whatever amount of detail you want for that,
Speaker 1 you have inputs that route through it, but they'll also just go directly to the
Speaker 1 end point that that module will contribute to. So there's a direct path and an indirect path.
Speaker 1 And so the model can pick up whatever information it wants and then add that back in.
Speaker 1 Well, what happens to the cerebellum?
Speaker 1 So the cerebellum nominally just does fine motor control.
Speaker 1 But I analogize this to the
Speaker 1 person who's lost their keys and is just looking under the streetlight, where it's very easy to observe this behavior.
Speaker 1 One leading cognitive neuroscientist said to me that a dirty little secret of any fMRI study where you're looking at brain activity for a given task is that the cerebellum is almost always active and lighting up for it.
Speaker 1 If you have a damaged cerebellum, you also are much more likely to have autism.
Speaker 1 So it's associated with like social skills. In one of these particular studies where I think they use PET instead of fMRI, but when you're doing next token prediction, the cerebellum lights up a lot.
Speaker 1 Also, 70% of your neurons in the brain are in the cerebellum. They're small.
Speaker 1 but they're there and they're taking up real metabolic cost.
Speaker 1 This was one of Gwern's points that, like, what changed with humans was not just that we have more neurons, or he says he shared this article,
Speaker 1 but specifically, there's more neurons in the cerebral cortex and the cerebellum. And
Speaker 1 maybe we should say more about this, but
Speaker 1 they're more radically expensive, and they're more involved in signaling and sending information back and forth. Yeah.
Speaker 1 Is that attention? What's going on? Yeah, yeah. So I guess the main thing I want to communicate here.
Speaker 1 So back in the 1980s, Penti Canerva came up with with an associative memory algorithm for I have a bunch of memories, I want to store them, there's some amount of noise or corruption that's going on, and I want to query or retrieve the best match.
Speaker 1 And so he writes this equation for how to do it, and a few years later realizes that if you implemented this as an electrical engineering circuit, it actually looks identical to the core cerebellar circuit.
Speaker 1 And that circuit and the cerebellum more broadly is not just in us, it's in basically every organism.
Speaker 1 There's active debate on whether or not cephalopods have it, they kind of have a different evolutionary trajectory.
Speaker 1 But even fruit flies with the Drosophila mushroom body, that is the same cerebellar architecture.
Speaker 1 And so that convergence, and then my paper, which shows that actually this operation is to a very close approximation the same as the attention operation, including implementing the softmax and having this sort of like nominal quadratic cost that we've been talking about.
Speaker 1 And so the three-way convergence here and the take-off and success of transformers
Speaker 1 seems pretty striking to me. Yeah.
Speaker 1 I want to zoom out and ask, I think what motivated this discussion in the beginning was we were talking about like, wait, what is the reasoning? What is the memory?
Speaker 1 What do you think about the analogy you found to attention and this?
Speaker 1 Do you think of this as more just looking up the relevant memories or the relevant facts? And if that is the case, like, where is the reasoning happening in the brain?
Speaker 1 How do we think about how that builds up into the reasoning? Yeah, so maybe my hot take here, I don't know how hot it is, is that
Speaker 1 most intelligence is pattern matching. And you can do a lot of really good pattern matching if you have a hierarchy of associative memories.
Speaker 1 You start with your very basic associations between just like objects in the real world.
Speaker 1 But you can then chain those and have more abstract associations, such as like a wedding ring symbolizes so many other associations that are downstream.
Speaker 1 And you can even generalize the attention operation and this associative memory as the MLP layer as well. It's in a long-term setting where you don't have tokens in your current context.
Speaker 1 But I think this is an argument that
Speaker 1 association is all you need.
Speaker 1 And associative memory in general as well, it's not... So you can do two things with it.
Speaker 1 You can both denoise or retrieve a current memory. So, like, if I see your face, but it's like raining and cloudy,
Speaker 1 I can denoise and kind of like gradually update my query towards my memory of your face.
Speaker 1 But I can also access that memory, and then the value that I get out actually points to some other totally different part of the space.
Speaker 1 And so, a very simple instance of this would be if you learn the alphabet, right? And so, I query for A and it returns B, I query for B and it returns C, and you can traverse the whole thing.
Speaker 1 Yeah.
Speaker 1 Yeah.
Speaker 1 One of the things I talked to Demis about was he had a paper in 2008 that memory and imagination are very linked because of this very thing that you mentioned of memory is reconstructive.
Speaker 1 And so you are in some sense imagining every time you're thinking of a memory because you're only storing a condensed version of it and you're like have to.
Speaker 1 And this is famously why human memory is terrible and like why people in the witness box or whatever will just make shit up.
Speaker 1 Okay, so
Speaker 1
let me ask a stupid question. So you like reach Sherlock Holmes, right? And like the guy is incredibly sample efficient.
He'll see a few observations and he'll like
Speaker 1 basically figure out who committed the crime because there's a series of deductive steps that leads from somebody's tattoo and what's on the wall to the implications of that.
Speaker 1 How does that fit into this picture? Because
Speaker 1 crucially what makes them smart is that there's not like an association, but there's a sort of deductive connection between different pieces of information.
Speaker 1 Would you just explain it as that that's just like higher-level association?
Speaker 1 I think so, yeah. So, so I think learning these higher-level associations to be able to then map patterns to each other as kind of like a meta-learning.
Speaker 1 I think in this case, he would also just have a really long context length or
Speaker 1 a really long working memory, right?
Speaker 1 Where he can like have all of these bits and continuously query them as he's coming up with whatever theory So that the theory is moving through the residual stream.
Speaker 1
His attention heads are querying his context. Right.
But then
Speaker 1 how he's projecting his query and keys in the space and how his MLPs are then retrieving longer-term facts or modifying that information is allowing him to then, in later layers, do even more sophisticated queries and slowly be able to reason through and come to a meaningful conclusion.
Speaker 1 That feels right to me. In terms of like, you're looking back in the past, you're selectively reading in certain pieces of information, comparing them.
Speaker 1 Maybe that informs your next step of what piece of information you now need to pull in.
Speaker 1 And then you build this representation, which progressively looks closer and closer and closer to the suspect in your case. Yeah.
Speaker 1 That doesn't feel at all outlandish. Do you know what
Speaker 1 lens on like
Speaker 1 something I think that the people who aren't doing this research can overlook is after your first layer of the model, every
Speaker 1 query key and value that you're using for attention comes from the combination of all the previous tokens. So like my first layer, I'll query my previous tokens and just extract information from them.
Speaker 1 But all of a sudden, let's say that I attended to tokens one, two, and four in equal amounts, then the vector in my residual stream, assuming that they wrote out the same thing to the value vectors, but ignore that for a second,
Speaker 1
is a third of each of those. And so when I'm querying in the future, my query is actually a third of each of those things.
And so. But they might be written to different subspaces.
That's right.
Speaker 1 Hypothetically, but they wouldn't have to.
Speaker 1 And so you can recombine and immediately, even by layer two and certainly by the deeper layers, just have like these very rich vectors that are packing in a ton of information.
Speaker 1 And the causal graph is like literally over every single layer that happened in the past. And that's what you're operating on.
Speaker 1
Yeah. It does bring to mind like a very funny eval to do would be like a Sherlock Holmes eval.
Let's see if you put the entire book into context.
Speaker 1 And then you have like a sentence, which is like the suspect is like X. Then you have like a logic probability distribution over like the different characters in the book.
Speaker 1 And then like as
Speaker 1 well or like
Speaker 1 that would be super cool.
Speaker 1 I wonder if you'd get anything at all. That would be cool.
Speaker 1 Sherlock Holmes is probably already in the trading data.
Speaker 1 You've got to get like a mystery novel that was written in the...
Speaker 1
You can get an LLM to write it. Or we can like purposely exclude it, right? Oh, you can? How do you...
Well, you need to scrape any discussion of it from Reddit or any other thing, right? Right.
Speaker 1 It's hard. But that's like one of the challenges that goes into things like long context evals is to get a good one, you need to know that it's not in your training data.
Speaker 1 You put in the effort to exclude it. What
Speaker 1 so
Speaker 1 I actually want to, there's two different threads I want to follow up on. Let's go to the long context one and then we'll come back to
Speaker 1 this.
Speaker 1 So in the Gemini 1.5 paper, the eval that was used was, can it like, something with Paul Graham essays? Can it like remember something? Yeah, the needle and the haystack. Right.
Speaker 1 Which, yeah, I mean, there's like,
Speaker 1 we don't necessarily just care about its ability to recall one specific fact from the context.
Speaker 1 I'll step back and ask the question,
Speaker 1 like the loss function for these models is unsupervised. You don't have to like come up with these bespoke things that you keep out of the training data.
Speaker 1 You know, is there a way you can do a benchmark that's also unsupervised where, I don't know, another LLM is derating it in some way or something like that?
Speaker 1 And maybe the answer is like, like, well, if you could do this, like reinforcement learning would work, because then you have this unsupervised yeah.
Speaker 1 I mean, I think people have explored that kind of stuff. Like, for example, Anthropic has a constitutional RL paper where they take another language model and they point it and say,
Speaker 1 how
Speaker 1 helpful or harmless was that response? And then they get it to update and try and improve along the Predo frontier of helpfulness and harmfulness.
Speaker 1 So you can point language models at each other and create evals in this way. It's obviously an imperfect art form at the moment
Speaker 1 because
Speaker 1 you get reward function hacking, basically, and the language. Like
Speaker 1 if you try and match up to what even humans are imperfect here, like if you try and match up to what humans will say, humans, you typically prefer longer answers, which aren't necessarily better answers.
Speaker 1 And you get that same behavior with models.
Speaker 1 On the other, so the other thread, going back to the Sherlock Holmes thing.
Speaker 1 If it's all associations all the way down, this is a sort of like naive dinner party question. If I just like match you, oh, you're I'm working on AI.
Speaker 1 But okay, does that mean we should be less worried about super intelligence? Because there's not this sense in which it's like Sherlock Holmes plus plus.
Speaker 1 It'll still need to just like find these associations, like humans find associations. And like, you know what I mean?
Speaker 1 It's not just like it sees a frame of the world and it's like figured out all the laws of physics.
Speaker 1 So for me,
Speaker 1 because this is a very legitimate response, right?
Speaker 1 It's like, well, artificial general intelligence aren't, if you say humans are generally intelligent, then they're no more capable or competent i i'm just worried that you have that level of general intelligence in silicon where you can then immediately clone hundreds of thousands of agents and they don't need to sleep and they can have super long context windows and then they can start recursively improving and then things get really scary um so so i think to answer your original question yes you're right they would still need to learn associations but
Speaker 1 the recursive self-improvement would still have to be them
Speaker 1
like if intelligence is fundamentally about these associations like the improvement is just them getting better at association. There's not like another thing that's happening.
And
Speaker 1 so then it seems like you might disagree with the intuition that, well, they can't be that much more powerful if they're just doing associations.
Speaker 1 Well, I think then you can get into really interesting cases of meta-learning.
Speaker 1 Like when you play a new video game or like study a new textbook, you're bringing a whole bunch of skills to the table to form those associations much more quickly.
Speaker 1 And like because everything in some way ties back to the physical world, I think there are like general features that you can pick up and then apply in novel circumstances.
Speaker 1 Should we talk about the intelligence explosion then? I don't know if it's a good idea.
Speaker 1 I mentioned multiple agents, and I'm like, oh, here we go.
Speaker 1 Okay.
Speaker 1 So
Speaker 1 the reason I'm interested in discussing this is with you guys in particular is the models we have of the intelligence explosion so far come from economists, which is fine, but I think we can do better because the very, like in the model of the intelligence explosion, what happens is you replace the AI researchers and then there's like a bunch of automated AI researchers who can speed up progress, make more AI researchers, make further progress.
Speaker 1 And so I feel like if that's the metric or that's the mechanism, we should just ask the AI researchers about whether they think this is plausible.
Speaker 1 So let me just ask you, like, if I have a thousand Asian Shotos or Asian Trentons, are they just, do you think that you get an intelligence explosion?
Speaker 1 Is that, yeah, what does that look like like to you?
Speaker 1 I think one of the important bounding constraints here is compute. Like, I do think you could dramatically speed up AI research, right?
Speaker 1 Like, it seems very clear to me that in the next couple of years, we'll have things that can do many of the software engineering tasks that I do on a day-to-day basis and therefore dramatically speed up my work
Speaker 1 and therefore speed up like the rate of progress, right?
Speaker 1 At the moment, I think most of the labs are somewhat compute-bound in that there are always
Speaker 1 more experiments you could run and more pieces of information that you could gain, in the same way that scientific research on biology is also somewhat experimentally
Speaker 1
throughput-bound. You need to be able to run and culture the cells in order to get the information.
I think that will be at least a short-term boundary constraint.
Speaker 1 Obviously, Sam's trying to raise $7 trillion to
Speaker 1 get chips. And so,
Speaker 1 it does seem like there's going to be a lot more compute in the future as everyone is heavily ramping.
Speaker 1 NVIDIA's stock price sort of represents the relative
Speaker 1 compute increase.
Speaker 1 But any thoughts?
Speaker 1 I think we need a few more nines of reliability in order for it to really be useful and trustworthy.
Speaker 1 Right now, it's like, and just having context links that are super long and it's like very cheap to have.
Speaker 1 Like if I'm working in our code base, it's really only small modules that I can get Claude to write for me right now.
Speaker 1 But it's very plausible that within the next few years
Speaker 1 or even sooner,
Speaker 1 it can automate most of my tasks. The only other thing here that I will note is
Speaker 1 the research that at least our sub-team in interpretability is working on is so early stage
Speaker 1 that you really have to be able to make sure everything is done correctly in a bug-free way and contextualize the results with everything else in the model.
Speaker 1 And if something isn't going right, be able to enumerate all of the possible things and then slowly work on those.
Speaker 1 Like an example that we've publicly talked about in previous papers is dealing with layer norm, right?
Speaker 1 And it's like, if I'm trying to get an early result or look at like the logit effects of the model, right?
Speaker 1 So it's like if I activate this feature that we've identified to a really large degree, how does that change the output of the model?
Speaker 1 Am I using layer norm or not? How is that changing the feature that's being learned?
Speaker 1 And that will take even more context or reasoning abilities for the model.
Speaker 1 So you used a couple of
Speaker 1 concepts together, and it's not self-evident to me that they're the same, but it seems like you were using them interchangeably. So I just want to
Speaker 1 like,
Speaker 1 one was, well, to work on the cloud code base and make more modules based on that, they need more context or something, where like, it seems like they might already be able to fit in the context.
Speaker 1 Do you mean like actual, do you mean like the context window context or like more
Speaker 1
window context. So yeah, it seems like now it might just be able to fit.
The thing that's preventing it from making good modules is not
Speaker 1
the lack of being able to put the code base in there. I think that will be there soon.
Yeah.
Speaker 1 But like it's not going to be as good at you as you at like coming up with papers because it can like fit the code base in there. No, but it will speed up a lot of the engineering.
Speaker 1 In a way that causes an intelligence explosion?
Speaker 1
No, that accelerates research. But I think these things compound.
So like the faster I can do my engineering, the more experiments I can run.
Speaker 1
And then the more experiments experiments I can run, the faster we can. I mean, my work isn't actually accelerating capabilities at all.
Right, right, but just interpreting the models.
Speaker 1 But we have a lot more work to do on that.
Speaker 1 Surprise to the Twitter
Speaker 1
guy. Yeah, I mean, for context, like when you released your paper, there was a lot of talk on Twitter about alignment to solve, guys.
Close the curtains.
Speaker 1 Yeah, yeah. No,
Speaker 1 it keeps me up at night how quickly the models are becoming more capable and just how poor our understanding still is of what's going on.
Speaker 1 Yeah,
Speaker 1
I guess I'm still. Okay, so let's think through the specifics here.
By the time this is happening, we have bigger models that are two to four orders of magnitude bigger, right?
Speaker 1 Or at least in effective compute, are two to four orders of magnitude bigger. And so
Speaker 1 this idea that, well, you can run experiments faster or something, you're having to retrain that model in this version of the intelligence explosion.
Speaker 1 Like the recursive self-improvement is different from what might have been imagined 20 years ago, where you just rewrite the code. You actually have to train a new model, and that's really expensive.
Speaker 1 Not only now, but especially in the future, as you keep making these models orders of magnitude bigger, doesn't that dampen the possibility of a sort of recursive self-improvement type intelligence explosion?
Speaker 1 It's definitely going to act as a breaking mechanism.
Speaker 1 I agree that the world of what we're making today looks very different to what people imagined it would look like 20 years ago.
Speaker 1 It's not going to be able to write its own code to be really smart because actually it needs to train itself. The code itself is typically quite simple,
Speaker 1 typically pretty small and self-contained.
Speaker 1 I think John Carmack had this nice phrase where it's like it's the first time in history where you can actually plausibly imagine writing AI with 10,000 lines of code.
Speaker 1 And that actually does seem plausible when you pair most training code code bases down to the limit.
Speaker 1 But it doesn't take away from the fact that this is something we should really strive to measure and estimate how progress might occur.
Speaker 1 We should be trying very, very hard right now to measure exactly how much of a software engineer's job is automatable and what the trend line looks like and be trying our hardest to project out those trend lines.
Speaker 1 But with all due respect to software engineers, you are not writing a React front-end right.
Speaker 1 So it's like, I don't know how this, like, what is concretely happening?
Speaker 1 And maybe you can walk me through, walk me through like a day in the life of, like, you're working on an experiment or project that's going to make the model, quote unquote, better. Right.
Speaker 1 Like, what is happening from observation to experiment to theory to like writing the code? What is happening?
Speaker 1 And so, I think important to contextualize here is that, like, I've primarily worked on inference so far. So, a lot of what I've been doing is just taking or helping guide the pre-training process,
Speaker 1 socially design a good model for inference, and then making the model and like the surrounding system faster.
Speaker 1 I've also done some pre-training work around that, but that hasn't been like my 100% focus, but I can still describe what I do when I do that work.
Speaker 1 Sorry, let me interrupt and say
Speaker 1 in Carl Schumann's and like when he was talking about it on the podcast, he did say that things like improving inference or even literally having like better, helping it make you help make better chips or GPUs, that's like part of the intelligence explosion.
Speaker 1
Yeah. Because obviously if the inference code runs faster, it happens better or faster or whatever.
Right. Anyway, sorry, go ahead.
Yeah.
Speaker 1 Okay, so what does concretely a day look like?
Speaker 1 I think the most important part to illustrate is this cycle of coming up with an idea, proving it out at different points in scale,
Speaker 1 and
Speaker 1 interpreting and understanding what goes wrong. And I think most people would be surprised to learn just how much goes into
Speaker 1 interpreting and understanding what goes wrong. Because the ideas, people have long lists of ideas that they want to try.
Speaker 1 Not every idea that you think should work will work, and trying to understand why that is is quite difficult. And like working out what you exactly need to do to interrogate it.
Speaker 1
So, so much of it is like introspection about what's going on. It's not pumping out thousands and thousands and thousands of lines of code.
It's not
Speaker 1 like the difficulty in coming up with ideas, even.
Speaker 1 I think many people have a long list of ideas that they want to try, but paring that down and shock calling under very imperfect information what the right ideas to explore further is really hard.
Speaker 1 Tell me more about
Speaker 1 what do you mean by imperfect information? Are these early experiments? Are these, like, what is the information that you're
Speaker 1 so Demis mentioned this in his podcast, and also, like, you obviously have like the GPT-4 paper, where you have like scaling law increments.
Speaker 1 And you can see like in the GPT4 paper, they have a bunch of like dots, right?
Speaker 1 Where they say we can estimate the performance of our final model using all of these dots, and there's a nice curve that flows through them. And Demis mentioned, yeah, that
Speaker 1 we do this process of scaling up.
Speaker 1 Concretely,
Speaker 1
why is that imperfect information? Is you never actually know if the trend will hold. For certain architectures, the trend has held really well.
And for certain changes, it's held really well.
Speaker 1 But that isn't always the case. And things which can help at smaller scales can actually hurt at larger scales.
Speaker 1 So
Speaker 1 making guesses based on what the trend lines look like, and based on your intuitive feeling of, okay, this is actually something that's going to matter.
Speaker 1 Particularly for those ones which help at the small scale.
Speaker 1 That's interesting to consider that for every chart you see in a released paper or technical report that shows that smooth curve, there's your graveyard of
Speaker 1
first few runs and then it's like flat. Yeah, there's all these other lines that go in like different directions off like TOL.
Presumably that's.
Speaker 1 Yeah, it's crazy, both as a grad student and then also here, like the number of experiments that you have to run before getting a meaningful result.
Speaker 1 Tell me, okay, so
Speaker 1 you, but presumably it's not just like you run it until it stops and then like, let's go to the next thing.
Speaker 1 There's some process by which to interpret the early data and also to look at your, like, I don't know.
Speaker 1 I could like put a Google Doc in front of you and I'm pretty sure you could just like keep typing for a while on like different ideas you have.
Speaker 1 And there's some bottleneck between that and
Speaker 1 just like making the models better immediately. Right.
Speaker 1 Yeah, walk me through like what is the what is the inference you're making from the first early steps that makes you have better experiments and better ideas.
Speaker 1 I think one thing that I didn't fully convey before was that I think a lot of good research comes from working backwards from the actual problems that you want to solve.
Speaker 1 And there's a couple of grand problems
Speaker 1 in making the models better today that you would identify as issues and then work back from, okay, how could I change it to achieve this?
Speaker 1 There's also a bunch of when you scale, you run into things and you want to fix behaviors or
Speaker 1 like issues at scale and that informs a lot of the research for the next increment and this kind of stuff.
Speaker 1 So, concretely, the barrier is a little bit software engineering. Often, having a code base that's large and
Speaker 1 sort of capable enough that it can support many people doing research at the same time makes it complex. If you're doing everything by yourself, your iteration pace is going to be much faster.
Speaker 1 I've heard that Alec Bradford, for example, famously did much of the pioneering work at OpenAI.
Speaker 1 He mostly works out of a Jupyter notebook and then has someone else who writes and productionizes that code for him. I don't know if that's true or not.
Speaker 1 But that kind of stuff, actually operating with other people
Speaker 1 raises the complexity a lot
Speaker 1 because
Speaker 1 from natural reasons, they are familiar to every software engineer.
Speaker 1 And then
Speaker 1 the inherent
Speaker 1 running
Speaker 1 and launching those experiments is easy, but there's inherent time slows downs induced by that.
Speaker 1 So you often want to be parallelizing multiple different streams because one, you can't be totally focused on one thing necessarily. You might not have fast enough feedback cycles.
Speaker 1 And then intuiting what went wrong is actually really hard.
Speaker 1 Working out what, like, this is in many respects the problem that the team that Trenton is on is trying to better understand what is going on inside these models.
Speaker 1 We have inferences and understanding and head canon for why certain things work, but it's not an exact science.
Speaker 1 And so you have to constantly be making guesses about why something might have happened, what experiment might reveal whether that is or isn't true. And that's probably the most complex part.
Speaker 1 The performance work by comparatively is easier, but harder in other respects. It's just a lot of low-level and difficult engineering work.
Speaker 1
Yeah. I agree with a lot of that.
But even on the interpretability team, I mean, especially with Chris Ola leading it, there are just so many ideas that we want to test.
Speaker 1 And it's really just having the engineering skill, but I'll put engineering in quotes because a lot of it is research, to very quickly iterate on an experiment, look at the results, interpret it, try the next thing, communicate them, and then just ruthlessly prioritizing what the highest priority things to do are.
Speaker 1 And this is really important, like the ruthless prioritization is something which I think separates a lot of quality research from research that doesn't necessarily succeed as much.
Speaker 1 We're in this funny field where
Speaker 1 so many of our theoretical, initial theoretical understanding is broken down. basically.
Speaker 1 And so you need to have this simplicity bias and ruthless prioritization over what's actually going wrong.
Speaker 1 And I think that's one of the things that separates the most effective people is they don't necessarily get too attached to solving
Speaker 1 using a given
Speaker 1 solution that they're necessarily familiar with, but rather they attack the problem directly.
Speaker 1 You see this a lot
Speaker 1
in like maybe people coming with a specific academic background. They try and solve problems with that toolbox.
And the best people are people who expand the toolbox dramatically.
Speaker 1 They're, you know, they're running around and they're taking ideas from reinforcement learning, but also from optimization theory. And also, they have a great understanding of systems.
Speaker 1
And so they know what the sort of constraints that bound the problem are. And they're good engineers.
They can iterate and try ideas fast.
Speaker 1 Like by far, the best researchers I've seen, they all have the ability to try experiments really, really, really, really, really fast.
Speaker 1
And that is that cycle time at smaller scales. Cycle time separates people.
I mean, machine learning research is just so empirical. Yeah.
Speaker 1 And this is honestly one reason why I think our solutions might end up looking more brain-like than otherwise.
Speaker 1 It's like, even though we wouldn't want to admit it, the whole community is kind of doing like a greedy evolutionary optimization over the landscape of possible AI architectures and everything else.
Speaker 1
It's like no better than evolution. And that's not even necessarily a slight against evolution.
That's such an interesting idea.
Speaker 1 I'm still confused on what will be the bottleneck for these, what would have to be true of an Asian such that it's like sped up your research.
Speaker 1 So in the Alec Radford example you gave where he apparently already has the equivalent of like co-pilot for his Jupyter notebook experiments.
Speaker 1 Is it just that if he had enough of those, he would be a dramatically faster researcher? And so, you just need Alec Radford.
Speaker 1 So, it's like you're not automating the humans, you're just making the most effective researchers who have great taste more effective and like running the experiments for them and so forth. Or like
Speaker 1 you're still working at the point at which the intelligence explosion is happening. You know what I mean? Like, is that what you're saying?
Speaker 1 Right.
Speaker 1 And if that were like directly true, why can't we scale our current research teams better, for example? Is I think an interesting question to ask.
Speaker 1 If this work is so valuable, why can't we take hundreds or thousands of people who are like, they're definitely out there
Speaker 1 and scale our organizations
Speaker 1 better?
Speaker 1 I think we are less at the moment bound by the sheer engineering work of making these things than we are by
Speaker 1 compute to run and get signal and
Speaker 1 taste in terms of what the actual right thing to do is and that like making those difficult inferences on imperfect information.
Speaker 1
For the Gemini team. Because I think for interpretability, we actually really want to keep hiring talented engineers.
And I think it's a big bottleneck for us to just keep making a lot bigger.
Speaker 1 Obviously,
Speaker 1 more people is better.
Speaker 1 But I do think it's interesting to consider. I think one of the biggest challenges that
Speaker 1 I've thought a lot about is how do we scale better? Google is an enormous organization. It has 200,000-ish people,
Speaker 1 maybe 80,000 or something like that.
Speaker 1 And one has to imagine if there were ways of scaling out Gemini's research program to all those fantastically talented software engineers.
Speaker 1 This seems like a key advantage that
Speaker 1 you would want to be able to take advantage of, you'd want to be able to use, but how do you effectively do that? It's a very complex organizational problem.
Speaker 1 So, compute and taste, that's interesting to think about because, at least, the compute part is not bottlenecked on more intelligence, it just bottlenecked on Sam 7 trillion or whatever, right?
Speaker 1 So, if I gave you 10x the H100s to run your experiments, how much more effective a researcher are you?
Speaker 1 How much more effective a researcher are you?
Speaker 1 I think the Gemini program would probably
Speaker 1
be like maybe five times faster with 10 times more compute or something like that. So that's pretty good elasticity of like 0.5.
Yeah. Wait, that's insane.
Yeah.
Speaker 1 I think like more compute would just like directly convert into progress. So you have some
Speaker 1 fixed size of compute and some of it goes to inference or some of, I guess, like, and also
Speaker 1
like to clients of GCP. Yep.
Some of it goes to, huh?
Speaker 1 Some of it goes to training. And there,
Speaker 1 I guess as a fraction of it, some of it goes to running the experiments for the full model. Yeah, that's right.
Speaker 1 Shouldn't then the fraction goes to experiments be higher given that you would just be like if like the bottleneck is research and research is bottlenecked by compute.
Speaker 1 And so one of the strategic decisions that every pre-training team has to make is like exactly what amount of compute do you allocate to your different training runs,
Speaker 1 like to your... your research program versus like scaling the last best like you know thing that you landed on um
Speaker 1 And I think
Speaker 1 they're all trying to arrive at a pre-optimal point here. One of the reasons why you need to still keep training big models is that you get information there that you don't get otherwise.
Speaker 1 So scale has all these emergent properties,
Speaker 1 which you want to understand better. And if you are always doing research and never,
Speaker 1 remember what I said before about
Speaker 1 you're not sure what's going to fall off the curve, right?
Speaker 1 If you keep doing research in this regime
Speaker 1 and keep on getting more and more compute efficient, you may never, you may have actually gone towards the path that actually eventually scales.
Speaker 1 So you need to constantly be investing in doing big runs too at the frontier of what you sort of expect to work.
Speaker 1 Okay, so then tell me what it looks like to be in the world where AI has significantly sped up AI research.
Speaker 1 Because from this, it doesn't really sound like the AIs are going off and writing the code from scratch and that's leading to faster output.
Speaker 1 It sounds like they're really augmenting the top researchers in some way. Like, yeah, tell me concretely, are they doing the experiments? Are they coming up with the ideas?
Speaker 1 Are they just like evaluating the outputs of the experiments? What's happening? So, I think there's like two worlds you need to consider here.
Speaker 1 One is where AI has meaningfully sped up our ability to make algorithmic progress, right?
Speaker 1 And one is where the output of the AI itself is the thing that's like the crucial ingredient towards like model capability progress. And like, specifically, what I mean there is
Speaker 1 like synthetic data, right?
Speaker 1 And in the first world, where it's meaningfully speeding up algorithmic progress, I think a necessary component of that is more compute.
Speaker 1 And you probably reach this elasticity point where AIs maybe at some point are easier to speed up and get onto context than yourself,
Speaker 1 sorry, than other people.
Speaker 1
And so AIs meaningfully speed up your work because they're a fantastic... co-pilot basically that helps you code like multiple times faster.
And that seems like actually quite reasonable.
Speaker 1 Super long context, super smart model, it's onboarded immediately and you can send them off to complete sub-tasks and sub-goals for you. And that actually feels very plausible.
Speaker 1 But again, we don't know because there are no great evals about that kind of thing.
Speaker 1 The best one is, as I said before, Sweetbench.
Speaker 1 In that one, somebody was mentioning to me, the problem is that when a human is trying to do a pull request, they'll type something out and they'll run it and see if it works.
Speaker 1 And if it doesn't, they'll rewrite it. None of this was part of the
Speaker 1
opportunities that the LLM was given when run on this. It just output it, and if it runs and checks all the boxes, then it passed.
So maybe it might have been an unfair test in that way.
Speaker 1 So you can imagine that is
Speaker 1 like if you were able to use that, that would be an effective training source for having like the key thing that's missing from a lot of training data is
Speaker 1 like the reasoning traces, right? And I think this would be: if I wanted to try and automate a specific field with like job family
Speaker 1 or like understand how
Speaker 1 at risk of automation that is, then
Speaker 1 having reasoning traces feels to me like a really important part of that.
Speaker 1 There's so many threads. Yeah, there's so many different threads in that I want to follow up on.
Speaker 1 Let's begin with the data versus
Speaker 1 like
Speaker 1 yeah, compute thing of like, is the output of these AIs the thing that's causing the intelligent explosion or something? Yeah.
Speaker 1 People talk about how these models are really a reflection on their data.
Speaker 1 I think there was, I forgot his name, but there's a great blog by this OpenAI engineer, and it was talking about at the end of the day, as these models get better and better, it just like, they're just going to be really effective
Speaker 1
maps of the data set. And so it's like, at the end of the day, like you got to stop thinking about architectures.
It's like the most effective architecture is doing an amazing job of mapping the data.
Speaker 1 So that implies that future AI progress comes from the AI just making really awesome data, right? Like that you're mapping to. And that's clearly a very important part.
Speaker 1 Yeah, that's really interesting.
Speaker 1 Does that look to you like,
Speaker 1 I don't know, like things that look like chain of thought? Or what do you imagine as these models get better, as these models get smarter, what does the synthetic data look like?
Speaker 1 When I think of really good data,
Speaker 1 to me, that reason something which involves a lot of reasoning to create. So in modeling that, and this is similar to Ilya's perspective on
Speaker 1 achieving super intelligence via effectively perfectly modeling the human textual output.
Speaker 1 But even in the near term, in order to model something like the archive papers or Wikipedia, you have to have an incredible amount of reasoning behind you in order to understand what next token might be
Speaker 1 being output.
Speaker 1 And so, for me,
Speaker 1 what I imagine as good data is like model, like data where you can similarly, at least like where it had to do reasoning to produce something.
Speaker 1 And then, like, the trick, of course, is how do you verify that that reasoning was correct?
Speaker 1 And this is why you saw like DeepMind do that geometry, like self-like the sort of like self-life of geometry, basically, or like the sort of tree search for your geometry.
Speaker 1
Because geometry is a really, it's an easily formalizable, easily verifiable field. So you can check if its reasoning was correct.
And you can generate heaps of data of correct,
Speaker 1 like
Speaker 1 verified geometry proofs, train on that, and you know that that's good data. It's actually funny because I had a conversation with Grant Sanderson last year where we were debating this.
Speaker 1 And I was like, fuck, dude, by the time they get the gold of the math alephiat, of course they're going to automate all the jobs.
Speaker 1 Yikes.
Speaker 1 On this synthetic data thing,
Speaker 1 one of the things I speculated about in my scaling post, which was heavily informed with discussions with you two
Speaker 1 and you especially, Shoto, was
Speaker 1 you can think of like human evolution through the perspective of like we get language, and so we're like generating the synthetic data, which right, you know, like our copies are generating the synthetic data, which we're trained on.
Speaker 1
And it's like this really effective genetics, a cultural, like co-evolutionary loop. And there's a verifier there too, right? Like there's the real world.
You might generate a theory about
Speaker 1 the gods cause the storms.
Speaker 1 And then someone else finds cases where that isn't true. And so you know that that sort of didn't match your verification function.
Speaker 1 And now actually instead you have some weather simulation, which required a lot of reasoning to produce and accurately matches reality.
Speaker 1
And you can train on that as a better model of the world. Like we are training on that and like stories and like scientific theories.
Yeah.
Speaker 1 I want to go back. I'm just remembering something you mentioned a little while ago
Speaker 1 given how sort of like empirical ML is, it really is an evolutionary process that's resulting in better performance and not necessarily an individual coming up with a breakthrough in a top-down way.
Speaker 1 That has interesting implications. First, being that
Speaker 1 there really is
Speaker 1 people
Speaker 1 like are concerned about capabilities increasing because more people are going into the field.
Speaker 1 I've somewhat been skeptical of that way of thinking, but from this perspective of just like more input, it really does, yeah, it feels more like, oh, I actually buy like the fact that more people are going to ICML means that there's like faster progress towards GPT-5.
Speaker 1
Yeah, you just have more genetic recombination and like shots on target. Yeah.
And I mean, aren't all fields kind of like that?
Speaker 1 Like there's this sort of scientific framery of like discovery versus invention, right? And discovery almost involves like
Speaker 1 whenever there's been a massive scientific breakthrough in the past, typically there are multiple people co-discovering that at like roughly the same time.
Speaker 1 And that feels to me at least a little bit like the mixing and trying of ideas. You can't try an idea that's so far out of scope that you have no way of verifying it with the tools you have available.
Speaker 1 Yeah, I think physics and math might be slightly different in this regard.
Speaker 1 But especially for biology or any sort of wetware, and to the extent we want to analogize neural networks here, it's comical how serendipitous a lot of the discoveries are.
Speaker 1 Like penicillin, for example.
Speaker 1 Another implication of this is
Speaker 1 this idea that HDI is just going to come tomorrow, of like somebody's just going to discover a new algorithm and we have HDI. That seems less plausible.
Speaker 1 It will just be a matter of more and more MOV researchers finding these marginal things that all add up together to make models better, right? Like, yeah.
Speaker 1 That feels like the correct story to me, yeah.
Speaker 1 Especially while we're still hardware constrained. Right.
Speaker 1 Do you buy this narrow window framing of the intelligence explosion of
Speaker 1 you have to each GPD 3 to GPD 4 is two ooms of orders of magnitude more compute, or at least more effective compute,
Speaker 1 in the sense that if you didn't have any algorithmic progress, it would have to be two orders of magnitude bigger, like the raw form to be as good.
Speaker 1 Do you buy the framing that given that you have to be two orders of magnitude bigger at every generation, if you don't get AGI by GPD-7 that can help you catapult an intelligence explosion, like you're kind of just fucked as far as
Speaker 1 much smarter intelligences go, and you're kind of stuck with GPT-7 level models for a long time.
Speaker 1 Because at that point, you're just consuming significant fractions of the economy to make that model, and we just don't have the wherewithal to make GPD-8.
Speaker 1 This is the Carl Schulman sort of argument:
Speaker 1 we're going to race through the orders of magnitude in the near term, but then longer term, it would be harder.
Speaker 1 I think he's already talked about it.
Speaker 1 But I do buy that framing.
Speaker 1 Yeah, I mean, I generally buy that increases in order of magnitude of compute by like, in absolute terms, almost like diminishing returns on capability, right?
Speaker 1 Like, we've seen over a couple orders of magnitude models go from being unable to do anything to be able to do huge amounts.
Speaker 1 And it feels to me like each incremental order of magnitude gives more nines of reliability at things, and so it unlocks things like agents.
Speaker 1 But at least at the moment, I haven't seen like transformatively
Speaker 1 like this, it doesn't feel like reasoning improves like linearly, so to speak, but rather somewhat sublinearly.
Speaker 1 That's actually a very bearish sign because one of the things we were chatting with one of our friends, and he made the point
Speaker 1 that if you look at what new applications are unlocked by GPT-4 relative to GPT-3.5, it's not clear that's like that much. Like a GPT-3.5 can do perplexity or whatever.
Speaker 1 So if there is this diminishing increase in capabilities and
Speaker 1 that increase costs exponentially more to get. That's actually a bearish sign on what 4.5 will be able to do or what 5 will unlock in terms of economic impact.
Speaker 1 That being said, for me, the jump between 3.5 and 4 is pretty huge. And so, even if
Speaker 1 another 3.5 to 4 jump is
Speaker 1 ridiculous, right? Like, if you imagine 5 as being a 3.5 to 4 jump, like straight off the bat in terms of ability to do SATs and this kind of stuff, you're going to be able to do that.
Speaker 1
Or if LSAT performance was particularly striking. Exactly.
You go from
Speaker 1 very smart, from you know not super smart to like very smart to like utter genius in the next generation instantly and it doesn't at least like to me feel like we're we're gonna sort of jump to utter genius in the next generation but it does feel like we'll get very smart plus lots of reliability and then like we'll see tbd what that continues to look like um
Speaker 1 um will go-fi be part of the intelligence explosion Where like you say synthetic data, but like in fact, it will be like it writing its own source code in some important important way.
Speaker 1 There was an interesting paper that you can use diffusion to like come up with model weights. I don't know how legit that was or whatever, but like, I don't know, something like that.
Speaker 1 So, GoFi is good old-fashioned AI, right? And can you define that? Because when I hear it, I think like if-else statements for symbolic logic.
Speaker 1 Sure.
Speaker 1
I actually want to make sure we don't. We like fully unpack the whole model improvement increments.
Yeah.
Speaker 1 Because I don't want people to come away with the perspective that actually this is super bearish and models aren't going to get much better and stuff.
Speaker 1 More, what I want to emphasize is the jumps that we've seen so far are huge.
Speaker 1 And even if those continue on a smaller scale, we're still in for extremely smart,
Speaker 1 like very reliable agents over the next couple of orders of magnitude. And so we didn't sort of fully close the thread on the narrow window thing.
Speaker 1 But when you think of like, let's say, GPT-4 cost, I don't know, let's call it $100 million or whatever.
Speaker 1 You have what, the 1B run, the 10B run, the 100B run, all seem very plausible by
Speaker 1 private company standards.
Speaker 1 And then the... You mean in terms of dollar? In terms of dollar.
Speaker 1 And then you can also imagine even like a 1T run being part of a national consortium or
Speaker 1 a national level
Speaker 1 thing, but much harder on the behalf of an individual company. But Sam is out there trying to raise $7 trillion, right? Like he's already preparing for a whole lot of magnitude more than the
Speaker 1 Overton winner. He shifted the oddest magnitude here beyond the national level.
Speaker 1 So I want to point out that one, we have a lot more jumps. And even if those jumps are relatively smaller, that's still a pretty stark improvement in capability.
Speaker 1 Not only that, but if you believe claims that GPT-4 is around 1 trillion parameter count, I mean, the human brain is between 30 and 300 trillion synapses.
Speaker 1 And so that's obviously not a one-to-one mapping. And we can debate the numbers, but it seems pretty plausible that we're below brain scale still.
Speaker 1 So, crucially, the point being that the algorithmic overhead is really high in the sense that, and maybe this is something we should touch on explicitly: of even if you can't keep dumping more compute beyond the models that cost a trillion dollars or something,
Speaker 1 the fact that the brain is so much more data efficient implies that if you can, we have the compute, if we had like the brain's algorithm to train
Speaker 1 training,
Speaker 1 if we could train as sample efficient as humans train from birth, we could make the AGI.
Speaker 1 Yeah, but the sample efficiency stuff, I never know exactly how to think about it because obviously a lot of things are hardwired in certain ways, right?
Speaker 1 And they're like the co-evolution of language and the brain structure.
Speaker 1 So it's hard to say. Also, there are some results that if you make your model bigger, it becomes more sample efficient.
Speaker 1 Yeah, and so the original scaling was paper had that, right? Like the budget model is almost empty, right? So, so maybe that also just solves it.
Speaker 1 Um, like you don't have to be more data efficient, but if your model is bigger, then you also just are more data efficient. Like, well, how do we think about
Speaker 1 yeah, how do what is like the explanation or why that would be the case? Like, a bigger model just sees the exact same data at the end of seeing that data, it's
Speaker 1 learn more from it.
Speaker 1 I mean, my like very naive take here would just be that, like, like, so one thing that the superposition hypothesis that interpretability has pushed
Speaker 1 is that your model is dramatically under-parameterized. And that's typically not the narrative that deep learning has pursued, right?
Speaker 1 But if you're trying to train a model on the entire internet and have it predict it with incredible fidelity, you are in the under-parameterized regime.
Speaker 1
And you're having to compress a ton of things and take on a lot of noisy interference in doing so. And so having a bigger model, you can just have cleaner representations that you can work with.
Yeah.
Speaker 1 For the audience, you should unpack why that, first of all, what superposition is and why that is the implication of superposition. Sure, yeah.
Speaker 1 So the fundamental result, and this was before I joined Anthropic, but the paper is titled Toy Models of Superposition, finds that even for small models, if you are in a regime where your data is high-dimensional and sparse, and by sparse I mean any given data point doesn't appear very often,
Speaker 1 your model will learn a compression strategy, which we call superposition, so that it can pack more features of the the world into it than it has parameters.
Speaker 1
And so the sparsity here is like, and I think both of these constraints apply to the real world. And modeling internet data is a good enough proxy for that.
Of like, there's only one door cache.
Speaker 1
Like, there's only one shirt you're wearing. There's like this liquid death can here.
And so these are all objects or features. And how you define a feature is tricky.
Speaker 1 And so you're in a really high-dimensional space because there are so many of them, and they appear very infrequently.
Speaker 1 And in that regime, your model will learn compression.
Speaker 1 To riff a little bit more on this,
Speaker 1 I think it's becoming increasingly clear. I will say, I believe that the reason networks are so hard to interpret is because, in a large part, this superposition.
Speaker 1 So, if you take a model and you look at a given neuron in it, right, a given unit of computation, and you ask, how is this neuron contributing to the output of the model when it fires?
Speaker 1 And you look at the data that it fires for, it's very confusing. It'll be like 10% of every possible input, or like Chinese, but also fish and trees and the word, a full stop in URLs, right?
Speaker 1 But the paper that we put out towards mono-semanticity last year shows that if you project the activations into a higher-dimensional space and provide a sparsity penalty, so you can think of this as undoing the compression in the same way that you assumed your data was originally high-dimensional and sparse.
Speaker 1 You return it to that high-dimensional and sparse regime, you get out very clean features. And things all of a sudden start to make a lot more sense.
Speaker 1 Okay,
Speaker 1 there's so many interesting threads there.
Speaker 1 The first thing I want to ask is
Speaker 1 the thing you mentioned about these models are trained in a regime where they're overparametrized. Isn't that when you have generalization, like grokking happens in that regime, right? So
Speaker 1 I was saying the models were under-parametrized. Oh, I guess I'm going to say that.
Speaker 1 Like, typically, people talk about deep learning as if the model was over-parametrized.
Speaker 1 But actually, the claim here is that they're dramatically under-parametrized, given the complexity of the task that they're trying to perform.
Speaker 1 Another question. So,
Speaker 1 the distilled models, like
Speaker 1 first of all, okay, so what is happening there? Because the earlier claims we're talking about is
Speaker 1 the smaller models are worse at learning than bigger models, but like GPT-4 Turbo, you could make the claim that actually GPT-4 Turbo is worse at reasoning style stuff than GPT-4,
Speaker 1 but probably knows the same facts. Like the distillation got rid of some of the reasoning things.
Speaker 1
Do we have any evidence that GPT-Turbo is a distilled version of 4? It might just be a new architecture. Oh, okay.
Yeah.
Speaker 1
It could just be a faster, more efficient UR architecture. Okay, interesting.
So that's cheaper. Yeah.
Speaker 1 How do you interpret what's happening in distillation? I think Gwern had had one of these questions on his website of why can't you train the distilled model directly? Why does it have to go through,
Speaker 1 is the picture like you had to project it from this bigger space to a smaller space?
Speaker 1 I mean, I think both models will still be using superposition.
Speaker 1 But the claim here is that you get a very different model if you distill versus if you train from scratch. Yeah.
Speaker 1 And
Speaker 1 it's just more efficient or it's just fundamentally different in terms of performance.
Speaker 1 I don't remember. But like, do you know?
Speaker 1 I think the traditional story for why distillation is more efficient is that normally during training, you're trying to predict this one hot vector that says, this is the token that you should have predicted.
Speaker 1 And if your reasoning process means that you're really far off predicting that, then actually
Speaker 1 you still get these gradient updates that you are in the right direction, but
Speaker 1 you're totally... it might be really hard for you to learn, to have learned to have predicted that in the context that you're in.
Speaker 1 And so what distillation does is it doesn't just have the one vector, it has like the full readout from the larger model, like of all of the probabilities.
Speaker 1 And so, you get more signal about what you should have predicted. It's not,
Speaker 1 in some respects, it's like
Speaker 1
showing a tiny bit of your working too. Yeah.
You know, like it's not just this was the answer. It's I see.
Speaker 1
Yeah. That makes a lot of sense.
It's kind of like watching a Kung Fu Master versus being in the Matrix and like just downloading the program. Exactly.
Exactly. Yep, yep.
Speaker 1 Just to make sure the audience got that. When you're training on a distilled model, model,
Speaker 1 you see all its probabilities over the tokens it was predicting and then over the ones you were predicting.
Speaker 1 And then you like update through all those probabilities rather than just seeing the last word and updating on that. Okay, so this actually raises a question I was intending to ask you.
Speaker 1 Right now, I think you were the one who mentioned you can think of chain of thought as adaptive compute. of
Speaker 1 like
Speaker 1 to step back and explain what what but by adaptive compute it's the idea is one of the things you would want models to be able to do is if a question is harder to spend more cycles thinking about it
Speaker 1 and so then how do you do that? Well, there's only a finite and predetermined amount of compute that one forward pass implies.
Speaker 1 So if there's like a complicated reasoning type question or math problem, you want to be able to spend a long time thinking about it.
Speaker 1 Then you do chain of thought where the model just like thinks through the answer and and you can think about it as like all those forward passes where it's like thinking through the answer, it's like being able to dump more compute into solving the problem.
Speaker 1 Now, going back to the signal thing,
Speaker 1 when it's doing chain of thought, it's only able to transmit that token of information.
Speaker 1 Whereas, like, as you were talking about, the residual stream is already a compressed representation of everything that's happening in the model.
Speaker 1 And then you're turning the residual stream into one token,
Speaker 1 which is like log of 50,000 or log of vocab size bits, which which is like, yeah, so tiny. So
Speaker 1 I don't think it's quite only transmitting like one token, right?
Speaker 1 Like if you think about it during a forward pass,
Speaker 1 you create these like KV values in a transform of forward pass that then like future steps attend to the KV values.
Speaker 1 And so all of those pieces of KV, like keys and values, are bits of information that you could use in the future.
Speaker 1 Is the claim that
Speaker 1 when you fine-tune on chain of thought, the way
Speaker 1 the key and value weights change so that the sort of steganography can happen in the KB cache? I don't think I could make that strong a claim.
Speaker 1 But that sounds positive, but it's like, that's a good head canon for why it works.
Speaker 1 And I don't know if there's any like papers explicitly demonstrating that or anything like that.
Speaker 1 But like, that's at least one way that you can imagine the model has
Speaker 1 over the, like during pre-training, right, the model's trying to predict these future tokens.
Speaker 1 And one thing that you can imagine it doing is learning to like smush information about potential futures into like the keys and values that it might want to use in order to predict future information.
Speaker 1 Like it kind of smoothes that information across time in the pre-training thing.
Speaker 1 So I don't know if like people are particularly training, like like training on chains of thought.
Speaker 1
I think the original chain of thought paper had that as like almost an immersion property of the model is you could like prompt it to do this kind of stuff. And it still worked pretty well.
Um,
Speaker 1 but that's like, yeah, it's a good head canon for why that works.
Speaker 1 Yeah, to be overly panetic here, it's like the tokens that you actually see in the chain of thought do not necessarily at all need to correspond to the vector representation that the model gets to see when it's deciding to attend back to those tokens.
Speaker 1 Exactly. In fact, like during training, you replace like what a training step is, is you actually replacing the token that the model output with the real next token.
Speaker 1 And yet it's still like learning because it has all this information
Speaker 1 internally.
Speaker 1 Like when you're getting a model to produce at inference time, you're taking the output, the token that it output, you're feeding it in the bottom, unembedding it, and it becomes the beginning of the new residual stream.
Speaker 1 And then you use the output of pass KVs to read into and adapt that residual stream.
Speaker 1 At training time, you do this thing called teacher forcing, basically, where you're like, actually,
Speaker 1 the token you were meant to output is this one. That's how you do do it in parallel, right? Because you have all the tokens, you put them all in in parallel, and you do the giant forward pass.
Speaker 1 And so the only information it's getting about the pass is the keys and values. It never sees the token that it output.
Speaker 1
It's kind of like it's trying to do the next token prediction. And if it messes up, then you just give it the correct answer.
Yeah. Right, right, yeah.
Okay, that makes sense.
Speaker 1 Because otherwise, it can become totally derailed. Yeah, it would go like off the train works.
Speaker 1 How much like the sort of Sega communication with the model model to its
Speaker 1 forward inferences,
Speaker 1 how much steganography and secret communication do you expect there to be?
Speaker 1 We don't know.
Speaker 1 Like honest answer, we don't know.
Speaker 1 But
Speaker 1 I wouldn't even necessarily classify it as secret information, right? Like a lot of the work that Trend Steam is trying to do is actually understand, and these are fully visible from the model side
Speaker 1 and from like this,
Speaker 1 maybe not the user, but like we should be able to understand and interpret what these values are doing and the information they're
Speaker 1 transmitting. I think that's a really important goal for the future.
Speaker 1 Yeah, I mean, there are some wild papers, though, where people have had the model do chain of thought, and it is not at all representative of what the model actually decides its answer is.
Speaker 1 And you can go in and edit.
Speaker 1 No, no, no. In this case, you can even go in and edit the chain of thought so that the reasoning is totally garbled and it will still output the true answer.
Speaker 1 But also that the chain of thought, like, yeah, it gets a better answer at the end of the chain of thought rather than not doing it at all.
Speaker 1 So like something useful is happening, but still the useful thing is not human understandable.
Speaker 1 I think in some cases you could also just ablate the chain of thought and it would have given the same answer anyways. Interesting.
Speaker 1
Interesting. Yeah.
So I'm not saying this is always what goes on, but like there's plenty of weirdness to be investigated.
Speaker 1 It's like a very interesting to go and look at and try and understand,
Speaker 1 I would say. Yeah.
Speaker 1
That you can do with open source models. And like, I think I wish there was more of this kind of interpretability and understanding work done on open models.
Yeah.
Speaker 1 I mean, even in our Anthropic's recent sleeper agents paper,
Speaker 1 which
Speaker 1 at a high level for people unfamiliar is basically
Speaker 1 I train in a trigger word. And when I say it, like if I say if it's the year of 2024, the model will write malicious code instead of otherwise.
Speaker 1 And they do this attack with a number of different models.
Speaker 1 Some of them use chain of thought, some of them don't.
Speaker 1 And those models respond differently when you try and remove the trigger.
Speaker 1 You can even see them do this like comical reasoning that's also pretty creepy and like
Speaker 1 where it's like, oh, well, it even tries to calculate in one case an expected value of like, well, the expected value of me getting caught is this.
Speaker 1 But then if I multiply it by the ability for me to like keep saying, I hate you, I hate you, I hate you, then like this is how much reward I should get.
Speaker 1 And then it will decide whether or not to like actually tell the interrogator that it's like malicious or not.
Speaker 1 But even, I mean, there's another paper from a friend, Miles Turpin,
Speaker 1 where you ask the model to, you give it a bunch of examples of where the correct answer is always A for multiple choice questions.
Speaker 1 And then you ask the model, what is the correct answer to this new question?
Speaker 1 And it will infer from the fact that all the examples are A that the correct answer is A,
Speaker 1 but its chain of thought is totally misleading. Like it it will make up random stuff that sounds plausible or that tries to sound as plausible as possible,
Speaker 1 but it's not at all representative of like the true answer. But isn't this how humans think as well? The famous split brain experiments where
Speaker 1 when
Speaker 1 a person who is suffering from seizures, one way to solve it is you cut the the thing that connects it to the corpus brain.
Speaker 1 And then the yeah, the speech half is on the left side, so it's not connected to the the part that decides to do a movement.
Speaker 1 And so if the other side decides to do something, the speech part will just make something up, and
Speaker 1
the person will think that's legit the reason they did it. Totally.
Yeah, yeah. It's just some people will hail chain of thought reasoning as like a great way to solve AI safety.
Speaker 1 Oh, I see. And it's like, actually, we don't know whether we can trust it.
Speaker 1 How much, what will this landscape of models communicating to themselves in ways we don't understand, how does that change with AI agents?
Speaker 1 Because then these things will, it's not just like the model itself with its previous caches, but like other instances of the model. And then
Speaker 1 it depends a lot on what channels you give them to communicate with each other, right? Like if you only give them text as a way of communicating, then they probably have to interpret.
Speaker 1 How much more effective do you think the models would be if they could like share the residual streams versus just text? Hard to know.
Speaker 1 But plausibly so. I mean, one easy way that you can imagine this is like, if you wanted to describe how a picture should look,
Speaker 1 only describing that with text would be hard.
Speaker 1 You want to, maybe some other representation would plausibly be easier. Totally.
Speaker 1 And so, like, you can look at how
Speaker 1 DALI works at the moment, right? Like, it produces those prompts.
Speaker 1 And when you play with it, you often can't quite get it to do
Speaker 1 exactly what the model wants or what you want.
Speaker 1 Only DALI has that for offering.
Speaker 1 It's too easy.
Speaker 1 A lot of information.
Speaker 1 And you can imagine being able to transmit some kind of denser representation of what you want would be helpful there. And that's two very simple agents, right?
Speaker 1
I mean, I think a nice halfway house here would be features that you learn from dictionary learning. Yeah, that would be a lot more.
Whereas
Speaker 1 you get more internal access, but a lot of it is much more human interpretable. Yeah.
Speaker 1 So, okay, for the audience, you would project the residual stream into this larger space where we know what each dimension actually corresponds to, and then back into the next agents or whatever.
Speaker 1 Okay, why? So,
Speaker 1 your claim is that we'll get AI agents when these things can
Speaker 1 are more reliable and so forth.
Speaker 1 When that happens, do you expect that it will be multiple copies of models talking to each other? Or will it be just
Speaker 1 adaptive compute to solve and the thing just like runs bigger,
Speaker 1 like more compute when it needs to do that kind of thing that the whole firm needs to do?
Speaker 1 And I ask this because there's two things that make me wonder about whether agents is the right way to think about what will happen in the future. One is with longer context,
Speaker 1 These models are able to ingest and consider the information that no human can.
Speaker 1 And therefore, we need like one engineer who's thinking about the front-end code and one engineer who's thinking about the back-end code.
Speaker 1 Where if this thing can just ingest the whole thing, this sort of like Hayeken problem of specialization goes away. Second, these models are just very general of
Speaker 1 you're not using different types of GPT-4 to do different kinds of things. You're using the exact same model, right?
Speaker 1 So I wonder if what that implies is in the future, like an AI firm is just like a model instead of a bunch of AI agents hooked together. That's a great question.
Speaker 1 I think, especially in the near term,
Speaker 1 it will look much more like agents hooked together. And I say that like purely because as humans, we're going to want to have these isolated, reliable, and like
Speaker 1 components that we can trust.
Speaker 1 And
Speaker 1 we're also going to want to, we're going to need to be able to improve and instruct upon those
Speaker 1 components
Speaker 1
in in ways that we can understand and improve. Like it's just throwing it all this giant black box company model.
Like one, it isn't going to work
Speaker 1 initially.
Speaker 1 Later on, of course, you can imagine it working, but initially it won't work.
Speaker 1 And two, we probably don't want to do it that way.
Speaker 1 Well, you can also have each of the smaller models. Well, each of the agents can be a smaller model that's cheaper to run, and you can fine-tune it so that it's actually good at the task.
Speaker 1 Trevor Burrus, Jr.: Though there's a future with, like, Dworkesh has brought up adaptive computer a couple of times.
Speaker 1 There's a future where, like, the distinction between small and large models disappears to some degree.
Speaker 1 And with long context, there's also a degree to which fine-tuning might disappear, to be honest.
Speaker 1 These two things that are very important today, today's landscape of models, we have whole different tiers of model sizes and we have fine-tuned models for different things.
Speaker 1 You can imagine a future where you just actually have a dynamic bundle of compute and
Speaker 1 infinite context
Speaker 1 that specializes your model to different things.
Speaker 1 One thing you can imagine is you have an AI firm or something and the whole thing is like end-to-end trained on the signal of like, like, did I make profits? Or, like, if that's like too ambiguous,
Speaker 1 if it's an architecture firm and they're making blueprints, did my client like the blueprints?
Speaker 1 And in the middle, you can imagine agents who are salespeople and agents who are like doing the designing, agents who like do the editing, whatever.
Speaker 1 Would that kind of signal work on an end-to-end system like that?
Speaker 1 Because, like, one of the things that happens in human firms is management considers what's happening at the larger level and like gives these
Speaker 1
fine-grained signals to the pieces or something when there's a bad or whatever. Yeah, in the limit, yes.
That's the dream of reinforcement learning, right?
Speaker 1 It's like all you need to do is provide this extremely sparse signal, and then over enough iterations, you sort of create the information that allows you to learn from that signal.
Speaker 1 But I don't expect that to be the thing that works first.
Speaker 1 I think this is going to require an incredible amount of care and diligence on the behalf of humans surrounding these machines and making sure they do exactly the right thing and exactly what you want and giving them right signals to improve in the ways that you want.
Speaker 1 Yeah, you can't train on the RL reward unless the model generates some reward.
Speaker 1 Yeah, yeah, exactly.
Speaker 1 You're in this sparse RL world where if it never, if the client never likes what you produce, then you don't get any reward at all. And it's kind of bad.
Speaker 1 But in the future, these models will be good enough to get the reward some of the time, right? This is the nines of reliability that I was talking about. Yeah.
Speaker 1 There's an interesting digression, by the way, on earlier you were talking about, well, we want dense representations that like that will be favored, right?
Speaker 1 Like, that's a more efficient way to communicate. A book that Trenton recommended,
Speaker 1 The Symbolic Species, has this really interesting argument
Speaker 1 that language is not just a thing that exists, but it was also something that evolved along with our minds.
Speaker 1 And specifically, it evolved to be both easy to learn for children and to something that helps children develop, right?
Speaker 1 Like, it's unpack that for me.
Speaker 1 Because, like, a lot of the things that children learn are received through language. Like, the languages that will be the fittest are ones that help
Speaker 1 raise the next generation, right? And that makes them smarter, better, or whatever.
Speaker 1 That gives them the concepts to express more complex ideas. Yeah.
Speaker 1
Yeah, that. And I guess more pedantically, just like not die.
Right. Sure.
Speaker 1 Let's you encode the important important shit to not die.
Speaker 1 And so then
Speaker 1 when we just think of like language as like, oh, you know, it's like this contingent and maybe suboptimal way to represent ideas. Actually,
Speaker 1 maybe one of the reasons that LLMs have succeeded is because language has evolved for tens of thousands of years to be the sort of cast in which young minds can develop. Right.
Speaker 1 Like that is the purpose it was evolved for. Well, certainly when you talk to like multimodal or like computer vision researchers versus when you talk to language model researchers,
Speaker 1 people who work in other modalities have to put enormous amounts of thought into exactly what the right representation space for the images is and like what the right signal to learn from there.
Speaker 1 Is it like directly modeling the pixels or is it
Speaker 1 some loss that's conditioned on?
Speaker 1 There's like a paper ages ago where they found that if you trained on the internal representations of an image net model, it helped you predict better. But then later on, that's obviously limiting.
Speaker 1 And so there was pixel CNN where they're trying to to discreetly model
Speaker 1
the individual pixels and stuff. But understanding the right level of representation there, really hard.
In language, people are just like, well, I guess you just predict that thanks token, right?
Speaker 1 It's kind of easy.
Speaker 1 Decisions made. I mean, there's the tokenization
Speaker 1 discussion and debate about like,
Speaker 1 but
Speaker 1 one of Gwen's favorites.
Speaker 1 Yeah.
Speaker 1 Yeah, that's really interesting. How much
Speaker 1 the case for multimodal being a way to bridge the data wall or get past the data wall is
Speaker 1 based on the idea that the things you would have learned from more language tokens anyway, you can just get from YouTube. Has that actually been the case?
Speaker 1 How much positive transfer do you see between different modalities where actually the images are helping you be better at writing code or something, just because the model is learning a latent capabilities just from trying to understand the image?
Speaker 1 Demis, in his interview with you, mentioned positive transfer.
Speaker 1 I'm going to get in trouble if you do that.
Speaker 1 But I mean, I can't say jeeps about that
Speaker 1
other than to say this is something that people believe. Yes, we have all of this data about the world.
It would be great if we could learn an intuitive sense of physics from it that helps us reason.
Speaker 1 That seems totally plausible.
Speaker 1 Yeah, I'm the wrong person to ask, but there are interesting interpretability pieces where if we fine-tune on math problems,
Speaker 1 the model just gets better at entity recognition. Right, really? What? Yeah, yeah.
Speaker 1 So, there's like a paper from David Bao's lab recently where they investigate what actually changes in a model when I fine-tune it with respect to the attention heads and these sorts of things.
Speaker 1 And they have this like synthetic problem of
Speaker 1 box A has this object in it, box B has this other object in it,
Speaker 1 what was in this box, and if you've and it makes sense, right? It's It's like
Speaker 1 you're better at attending to the positions of different things which you need for coding and manipulating math equations.
Speaker 1 I love this kind of research.
Speaker 1 What's the name of the paper? Do you know it?
Speaker 1 If you look up like fine-tuning math models, math, David Bows grid, that came out like a week ago. Okay.
Speaker 1
And I'm not going to get it. I'm not endorsing the paper.
That's like a longer conversation. But like this, it does talk about and cite other work on this like entity recognition ability.
Yeah.
Speaker 1 One of the things you mentioned to me a long time ago is the evidence that when you train LLMs on code, they get better at reasoning and language, which, unless it's the case that the comments in the code are just really high-quality tokens or something, implies that to be able to think through how to code better, like makes you
Speaker 1 a better reasoner. And that's crazy, right? Like, I think that's one of the strongest pieces of evidence for scaling, just making the thing smart.
Speaker 1
That kind of positive transfer. And I think this is true in two senses.
One is just that modeling code obviously implies modeling a difficult reasoning process used to create it.
Speaker 1 But two, that code is a nice, explicit structure of
Speaker 1 composed reasoning, I guess.
Speaker 1 If this, then that, like,
Speaker 1 encodes a lot of structure in that way.
Speaker 1 Yeah. That you could imagine transferring to other
Speaker 1
types of reasoning problem. Right.
And crucially, the thing that makes this significant is that
Speaker 1 it's not just stochastically predicting the next token of words or whatever, because it's like learned that
Speaker 1 a Sally corresponds to murderer at the end of a Sherlock Holmes story. No, like if there is some shared thing between code and language, it must be at a deeper level that the modern host learned.
Speaker 1 Yeah, I think we have a lot of evidence that actual reasoning is occurring in these models and that they're not just stochastic parrots. Yeah.
Speaker 1 It just feels very hard for me to believe that, having worked and played with these models.
Speaker 1 Normies who will listen will be like, you know,
Speaker 1 yeah, my two immediate cached responses to this are one, the work on Othello and now other games, where it's like, I give you a sequence of moves in the game, and it turns out if you apply some pretty straightforward interpretability techniques, then you can get a board that the model has learned.
Speaker 1 And it's never seen the game board before, anything, right? Like that's generalization.
Speaker 1 The other is Anthropic's influence functions paper that came out last year, where they look at the model outputs, like please don't turn me off I want to be helpful and then they scan like what was the data that led to that and like one of the data points that was very influential was someone dying of dehydration in the desert and like having like a will to keep surviving
Speaker 1 and and to me that just seems like very clear
Speaker 1 generalization of motive rather than regurgitating don't turn me off I think 2001 a space odyssey was also one of the influential things and so that's that's more related but it's clearly pulling in things from lots of different distributions.
Speaker 1 And I also like the evidence you see even with very small transformers, where you can explicitly encode circuits to do addition, right? Like modern heads. Or induction heads.
Speaker 1 Or induction heads, this kind of thing. You can literally encode basic reasoning processes in the models manually.
Speaker 1 And it seems clear that there's evidence that they also learn this automatically because you can then rediscover those from trained models.
Speaker 1
To me, this is pretty strong evidence. The models are underparameterized.
They need to learn.
Speaker 1
We're asking them to do do whatever. They want to learn.
And they want to learn. The gradients want to flow.
Speaker 1 And so they're learning more general skills.
Speaker 1 Okay, so I want to take a step back from the research and
Speaker 1 ask about your careers specifically, because like the tweet implied
Speaker 1 that I introduced you with, you've been in this field a year and a half. I think you've only been in it like a year or something, right? It's like...
Speaker 1 Yeah. But, you know, like
Speaker 1 in that time I know the solve the line takes are overstated and you won't say this yourself because you'd be embarrassed me but like you know, it's like a pretty incredible thing like the thing that people in mechanism tribally think is the biggest
Speaker 1 you know step forward and you've like been working on it for a year is notable
Speaker 1 so
Speaker 1 I'm curious how you explain what's happened like why in a year or year and a half have you guys been
Speaker 1 you know made important contributions to your field?
Speaker 1 It goes without saying luck, obviously. And I feel like I've been very lucky and like the timing of different progressions has been just really good in terms of advancing to the next level of growth.
Speaker 1 I feel like for the interpretability team specifically, I joined when we were five people. We've now grown quite a lot.
Speaker 1 But there were so many ideas floating around and we just needed to really execute on them and have like quick feedback loops and like do careful experimentation that led to like signs of life and have now allowed us to like really scale.
Speaker 1 And I feel like that's kind of been my biggest value add to the team,
Speaker 1 which it's not all engineering, but quite a lot of it has been.
Speaker 1 Interesting.
Speaker 1 So you're saying like you came at a point where there was been a lot of science done and there was a lot of good research floating around, but they needed someone to just take that and maniacally execute on it.
Speaker 1 Yeah, yeah. And
Speaker 1 this is why it's not all engineering, because it's like running different experiments and having a hunch for why it might not be working and then like opening up the model or opening up the weights and like what is it learning?
Speaker 1 Okay, well, let me try and do this instead and that sort of thing. But a lot of it has just been being able to do like very careful, thorough, but quick investigation of different ideas or theories.
Speaker 1 And why was that lacking in the existing?
Speaker 1 I don't know.
Speaker 1 I feel like I, I mean, I work quite a lot and then I feel like I just am like quite agentic. Like if you're, if your question is about like career overall,
Speaker 1 and I've been very privileged to have like a really nice safety net to be able to take lots of risks. But I'm just quite headstrong.
Speaker 1 Like in undergrad, Duke had this thing where you could just make your own major. And it was like, eh, I don't like this prerequisite or this prerequisite.
Speaker 1 And I want to take all four or five of these subjects at the same time. So I'm just going to make my own major.
Speaker 1 Or like in the first year of grad school, I like canceled rotations so I could work on this thing that became the paper we were talking about earlier.
Speaker 1 And didn't have an advisor, like got admitted to do machine learning for protein design and was just off in computational neuroscience land with no business there at all, but worked out.
Speaker 1 There's a headstrong in this, but it seemed like another theme that jumped out was
Speaker 1 the ability to step back, and you were talking about this earlier, the ability to stick back from your sun costs and go in a different direction is in a weird sense the opposite of that, but also a crucial step here.
Speaker 1 Where I know like 21-year-olds or like 19-year-olds who are like, ah, this is not a thing I've specialized in, or like, I didn't major in this. I was like, dude, motherfucker, you're 19.
Speaker 1 Like, you can definitely do this. And you like, switching in the middle of grad school or something, like, that's
Speaker 1 just like, yeah. Yeah, sorry, I didn't mean to cut you off, but I think it's like strong ideas loosely held
Speaker 1 and being able to just like pinball in different directions. And the headstrongness, I think, relates a little bit to the fast feedback loops or agency in so much as
Speaker 1 I just don't get blocked very often.
Speaker 1 Like, if I'm trying to write some code and like something isn't working, even if it's like in another part of the code base, I'll often just go in and fix that thing or at least hack it together to be able to get results.
Speaker 1
And I've seen other people where they're they're just like, help, I can't. And it's like, no, that's not a good enough excuse.
Like, go all the way down.
Speaker 1 I've definitely heard people in management type positions talk about the lack of such people where they'll check in on somebody a month after they gave them a task or a week after they gave them a task.
Speaker 1 I'm like, how's it going? And they say, well, you know, we need to do this thing, which requires lawyers because it requires talking about this regulation. It's like, how's that going?
Speaker 1 And it's like, well, we need lawyers. I'm like, why didn't you get lawyers?
Speaker 1 Or something like that.
Speaker 1
So, that's definitely like, yeah. I think that's arguably the most important quality in like almost anything.
It's just pursuing it to like the end of the earth.
Speaker 1
And, like, whatever you need to do to make it happen, you'll make it happen. If you do everything, you'll win.
If you do everything, you'll win.
Speaker 1 But, yeah, yeah, yeah, yeah.
Speaker 1 I think from my side,
Speaker 1 definitely that quality has been important, like agency of the work.
Speaker 1 There are thousands, I would even like probably tens of thousands of engineers at Google who are like, you know, basically, like, we're all of like equivalent like software engineering ability, let's say.
Speaker 1 Like, you know, if you gave us like a very well-defined task,
Speaker 1 then we'd probably do it like equivalently well. Maybe a bunch of them would do it a lot better than me, you know, in all likelihood.
Speaker 1 But what I've been, like, one of the reasons that I've been impactful so far is I've been very good at picking extremely high leverage problems.
Speaker 1 So problems that haven't been like particularly well solved so far.
Speaker 1 Perhaps as a result of like
Speaker 1 frustrating structural factors, like the ones that you pointed out in that scenario before, where they're like, oh, we can't do X because this team won't do Y, or like, and then going, okay, well, I'm just going to vertically solve the entire thing.
Speaker 1 And that turns out to be remarkably effective. Also, I'm very comfortable with,
Speaker 1 if I think there is something correct that needs to happen, I will make that argument and continue making that argument at escalating levels of
Speaker 1 criticality until that thing gets solved.
Speaker 1 And I'm also quite pragmatic with what
Speaker 1 I do to solve things. You get a lot of people who come in with, as I said before, a particular background or a familiarity or they know how to do something and they won't.
Speaker 1 One of the beautiful things about Google, right, is you can run around and get world experts in literally everything.
Speaker 1 You can sit down and talk to people who are optimization experts, like TP, like chip design experts,
Speaker 1 like experts in,
Speaker 1 I don't know, like different forms of like pre-like pre-training algorithms or like RL or or whatever. And you can learn from all of them and you can take those methods and apply them.
Speaker 1 And I think this was like maybe
Speaker 1 the start of why I was initially impactful: like this vertical agency, effectively.
Speaker 1 And then a follow-up piece from that is, I think it's often surprising how few people are like fully realizing all the things they want to do. They're like blocked or limited in some way.
Speaker 1 And this is very common, like in big organizations everywhere, people like have all these blockers on what they're able to achieve.
Speaker 1 And I think
Speaker 1 being a
Speaker 1 like one, helping inspire people to work on particular directions and working with them on doing things massively scales your leverage.
Speaker 1 Like you, you get to work with all these wonderful people who teach you heaps of things.
Speaker 1 And
Speaker 1 generally like helping them push past organizational blockers
Speaker 1 means like together you get an enormous amount done. Like none of the impact that I've had has been like me individually going off and solving a whole lot of stuff.
Speaker 1 It's been me maybe like starting off a direction and then convincing other people that this is the right direction and bringing them along in like this big tidal wave of like effectiveness that like goes and solves that problem.
Speaker 1 We should talk about
Speaker 1 how you guys got hired because I think that's a really interesting story because you were a McCunze consultant right one year ago.
Speaker 1 There's an interesting thing there where
Speaker 1 first of all, I I think people are
Speaker 1 yeah, generally people just don't understand how
Speaker 1 decisions are made about either admissions or evaluating who to hire or something. But like just talk about like how were you noticed as
Speaker 1
totally got hired. So like the TLDR with this is I studied robotics in undergrad.
I always thought that AI would be one of the highest leverage ways to impact the future in a buzzer way.
Speaker 1 The reason I am doing this is because I think it is like one of our best shots at making a wonderful future, basically.
Speaker 1 And I thought that working actually at McKinsey, I would get a really interesting insight into what people actually did for work.
Speaker 1 Like, and this, I actually wrote this as the first line in my cover letter to McKinsey:
Speaker 1 I want to work here so that I can learn what people do, so that I can understand how.
Speaker 1 And
Speaker 1 in many respects, I did get that.
Speaker 1 I also got a whole lot of other things. Many of the people there are like wonderful friends.
Speaker 1 I actually learned, I think, a lot of this agentic behavior in part from my time there, where you go into organizations and you see how impactful just not taking no for an answer gets you.
Speaker 1 Like, it's crazy, like you would be surprised at the kind of stuff where like, because
Speaker 1 no one quite cares enough in some organizations,
Speaker 1 things just don't happen because no one's willing to take direct responsibility. This is incredibly like directly responsible individuals are ridiculously important.
Speaker 1 And people are willing to, like, they just don't care as much about timelines.
Speaker 1 And so much of the value that an organization like McKinsey provides is hiring people who you were otherwise unable to hire for a short window of time where they can just like push through problems.
Speaker 1 I think people like underappreciate this.
Speaker 1 And so, like, at least some of my, well, hold up, like, I'm going to become the directly responsible individual for this because no one's taking appropriate responsibility.
Speaker 1 I'm going to care a hell of a lot about this and I'm going to make sure, like, I'm going to the end of the earth to make sure it gets done comes from that time.
Speaker 1 But more to your like actual question of like, how did I
Speaker 1 get hired?
Speaker 1 The entire time, I didn't get into the grab programs that I wanted to get into over here,
Speaker 1 which was specifically for focus on like robotics and RL research and that kind of stuff.
Speaker 1 And in the meantime, on nights and weekends, basically every night from 10 p.m. till 2 a.m., I would do my own research.
Speaker 1 And every weekend, for like at least six to eight hours each day, I would do my own research and coding projects and this kind of stuff.
Speaker 1 And
Speaker 1 that sort of switched in part from like quite robotic-specific work to after reading Gwern's scaling hypothesis post, I got completely scaling pilled and was like, okay, but clearly the way that you solve robotics is by like scaling large multimodal models.
Speaker 1 And then in an effort to scale large multimodal models with a very,
Speaker 1 you know, grant, I got a grant from the TPU like access program,
Speaker 1 like Tensor Research Cloud.
Speaker 1 I was trying to work out how to scale that effectively. And James Bradbury, who at the time was at Google and is now at Anthropic,
Speaker 1 saw some of my questions online where I was trying to work out how to do this properly. And he was like, I thought I knew all the people in the world who were like asking these questions.
Speaker 1 Who on earth are you?
Speaker 1 And
Speaker 1 he looked at that and he looked at some of the robotic stuff that I'd been putting up on my blog and that kind of thing.
Speaker 1 And he reached out and said, hey, do you want to have a chat and you want to explore working with us here?
Speaker 1 And I was hired, as I understand it later, as an experiment in trying to take someone with extremely high enthusiasm and agency and pairing them with some of the best engineers that he knew.
Speaker 1 And so, another one of the reasons I could say I've been impactful is I had this dedicated mentorship from utterly wonderful people, like people like Rainer Pope, who has since left to go do his own chip company, Anselm Leskaya, James himself, many others, but those were the sort of formative two to three months at the beginning.
Speaker 1 And they taught me a whole lot of the principles and heuristics that I apply,
Speaker 1 and how to solve problems in the way that they have,
Speaker 1 particularly in that systems and algorithms overlap. Where
Speaker 1 one more thing that makes you quite effective in ML research is really concretely understanding the systems side of things.
Speaker 1 And this is something I learned from them, basically, is a deep understanding of how systems influence algorithms and how algorithms influence systems.
Speaker 1 Because the systems constrain the design space, sorry, the solution space, which you have available to yourself in the algorithm side. And very few people are comfortable fully bridging that gap.
Speaker 1 But
Speaker 1 a place like Google, you can just go and ask all the algorithms experts and all the systems experts everything they know, and they will happily teach you.
Speaker 1 And if you go through and sit down with them,
Speaker 1 they will teach you everything they know, and it's wonderful.
Speaker 1 And this has meant that I've been able to be very, very effective for both sides, like for the pre-training crew, because I understand systems very well.
Speaker 1 I can intuit and understand this will work well or this won't.
Speaker 1 And then flow that on through the inference considerations of models and this kind of thing.
Speaker 1 And for the chip design teams, I'm one of the people they turn to to understand what chips they should be designing in three years because I'm one of the people who's best able to understand and explain the kind of algorithms that we might want to design in three years.
Speaker 1 And obviously, you can't make very good guesses about that, but like, I
Speaker 1 think I convey the information well accumulated from all of my compatriots on the pre-training crew
Speaker 1 and like the general systems design crew and convey that information well to them. Because also, even inference applies a constraint to pre-training.
Speaker 1 And so there's this trees of constraints where if you understand all the pieces of the puzzle, then you get a much better sense for what the solution space might look like.
Speaker 1 There's a couple of things that stick out to me there.
Speaker 1 One is not just the agency of the person who was hired, but the parts of the system that were able to think, wait, that's really interesting. Who is this guy?
Speaker 1 Not from a grad program or anything,
Speaker 1 you know, like currently a McKinsey consultant, just like undergrad,
Speaker 1
but that's interesting. Let's give this a shot, right? So James and whoever else, that's like, that's very notable.
And that's,
Speaker 1 second is, I actually didn't know this part of the story where that was part of an experiment run internally about, can we do this? Can we like bootstrap somebody?
Speaker 1 And like, yeah. And in fact, what's really interesting about that is the third thing you mentioned is
Speaker 1 having somebody who understands all layers of the stack and isn't so stuck on any one approach or any one layer of abstraction is so important. And specifically,
Speaker 1 like what you mentioned about being...
Speaker 1 being bootstrapped immediately by these people might have meant that since you're getting up to speed on everything at the same time rather than spending grad school going deep on like one specific way of doing RL, you actually can take the global view and aren't like totally bought in on one thing.
Speaker 1 So not only can is it something that's possible, but like has greater returns than just hiring somebody at a grad school potentially because this person can just like, I don't know, just like getting GPT-8 and like fine-tuning them on like one year of,
Speaker 1 you know what I mean? So, yeah, that is really good. You come at everything with fresh eyes and you not come in lock to any particular field.
Speaker 1 Now, what like one caveat to that is that before, like during my self-experimentation and stuff, I was reading everything I could. I was like obsessively reading papers every night.
Speaker 1 And like, actually, funnily enough, I like
Speaker 1 read
Speaker 1 much less widely now that I like my day is occupied by working on things.
Speaker 1 And in some respect, I had like this very broad perspective before where not that many people, even like in a PhD program, you'll focus on a particular area.
Speaker 1 If you just like read all the NLP work and all the computer vision work and like all the robotics work, you like see all these patterns just start to emerge across subfields in a way that, I guess, like foreshadowed some of the work that I would later do.
Speaker 1 That's super interesting. One of the reasons that you've been able to be agentic within Google is like you're peer-programming half the days or most of the days with Sergey Brin, right?
Speaker 1 And so that's really interesting that like
Speaker 1 there's this person who's willing to just push ahead on this LLM stuff and like get rid of the local block blockers in its place.
Speaker 1 I think important to Gabe is like not like every day or anything that I'm pairing, but like when there are particular projects that he's interested in, then like we'll work together on those.
Speaker 1 I'm like, but there's also been times when he's been focused on projects with other people.
Speaker 1 But in general, yes, there's a surprising alpha to being one of the people who actually goes down to the office every day.
Speaker 1 That is really actually shouldn't be, but is surprisingly impactful.
Speaker 1 And as a result,
Speaker 1 I've benefited a lot from having
Speaker 1 basically being close friends with people in leadership who care and being able to
Speaker 1 really argue convincingly about why we should do X as opposed to Y.
Speaker 1 And having that vector to
Speaker 1 try and like, Google is a big organization.
Speaker 1 Having those vectors helps a little bit.
Speaker 1 But also, it's very important. It's the kind of thing you don't want to ever abuse.
Speaker 1 You want to make the argument through all your
Speaker 1 right channels. And
Speaker 1
only sometimes do you need to. And so this inclusive Alexa Gabrian, Jeff D and so forth.
I mean, it's like, it's notable. I don't know.
I feel like Google is undervalued given that
Speaker 1 Steve Jobs is working on the equivalent of the next product for Apple, like Peripro running on it or something.
Speaker 1 I mean, like, I've benefited immensely from like,
Speaker 1 okay, so for example, during the Christmas break,
Speaker 1 I was just going into the office a couple of days during that time.
Speaker 1 Quite a lot of it.
Speaker 1 And
Speaker 1 I don't know if you guys have read that article about Jeff and Sanjay doing the pair programming, but they were there pair programming on stuff.
Speaker 1 And I got to hear about all these cool stories of like early Google, where they were talking about crawling under the floorboards and rewiring data centers and telling me how many bits they were pulling off theta they were pulling off the instructions of a given compiler and instruction.
Speaker 1 And all these crazy little performance optimizations they were doing. They were having the time of their live.
Speaker 1 And I got to sit there and really experience this
Speaker 1
sense of history in a way that you... you don't expect to get.
You expect to be very far away from all that, I think, maybe in a large organization.
Speaker 1 yeah, that's super cool.
Speaker 1 Trendin, does this map onto any of your experience? I think Shalto's story is more exciting.
Speaker 1 Mine was just very serendipitous in that I got into computational neuroscience, didn't have much business being there.
Speaker 1 My first paper was mapping the cerebellum to the attention operation and transformers. My next ones were looking at like
Speaker 1 you wrote that? It was my first year at grad school.
Speaker 1 So 22. Oh, yeah.
Speaker 1 But yeah, my next work was on sparsity in networks, like inspired by sparsity in the brain, which was when I met Tristan Hume.
Speaker 1 And Anthropic was doing the SOLU, the softmax linear output unit work, which was very related in quite a few ways of like, let's make the activation of neurons across a layer really sparse.
Speaker 1 And if we do that, then we can get some interpretability of what the neuron's doing.
Speaker 1 I think we've updated on that approach towards what we're doing now.
Speaker 1
So that started the conversation. I shared drafts of that paper with Tristan.
He was excited about it.
Speaker 1 And that was basically what led me to become Tristan's resident and then convert to full time.
Speaker 1 But during that period, I also moved as a visiting researcher to Berkeley and started working with Bruno Olshausen, both on what's called vector symbolic architectures, which one of the core operations of them is literally superposition, and on sparse coding.
Speaker 1
also known as dictionary learning, which is literally what we've been doing since. And Bruno Olshausen basically invented sparse coding back in 1997.
And so it was like,
Speaker 1 my research agenda and the interpretability team seemed to just be running in parallel
Speaker 1 with just research taste. And so it made a lot of sense for me to work with the team.
Speaker 1 And it's been a dream since. One thing I've noticed that when people tell stories about their careers or their successes, they ascribe it way more to contingency.
Speaker 1 But when they hear about other people's stories, they're like, of course it wasn't contingent. Do you know what I mean? It's like, if that didn't happen, something else would have happened.
Speaker 1 I've just noticed this literally like talked to, and it's like interesting that you both think that there are like, it was especially contingent.
Speaker 1 Whereas, I don't know,
Speaker 1 maybe you're right, but like, it's a sort of interesting pattern in that.
Speaker 1 Yeah, but I mean, like, I literally met Tristan at a conference and like wasn't, didn't have a scheduled meeting or anything, just like joined a little group of people chatting.
Speaker 1
And he happened to be standing there. And I happened to mention what I was working on.
And that led to more conversations.
Speaker 1 And I think I probably would have applied to Anthropic at some point anyways, but I would have waited at least another year.
Speaker 1 Yeah,
Speaker 1 it's still crazy to me that I can like actually contribute to interpretability in a meaningful way. I think there's an important aspect of like shots on goal there, so to speak, right?
Speaker 1 Where like you're even just going to choosing to go to conferences itself is like putting yourself in a position where you're where luck is more likely to happen.
Speaker 1 And like conversely, in my own situation, it was like doing all of this work independently and trying to produce and do interesting things was my own way of like trying to manufacture luck, so to speak.
Speaker 1 And like try and do something meaningful enough that it got noticed. Given that you said you frame this in the context of they were trying to run this experiment of can something
Speaker 1 specifically James and I think our manager Brennan was trying to run this experiment.
Speaker 1
It like worked. Did they do it again? Yeah.
So my like closest collaborator Enrique, he
Speaker 1
crossed from search through to our team. He's also been ridiculously impactful.
He's definitely a stronger engineer than I I am.
Speaker 1 And he didn't go to university. How was, like, what was notable about, for example, is James Bradbury is somebody who's
Speaker 1 usually this kind of stuff is like farmed out to recruiters or something like that. Whereas James, I like somebody whose time is worth like hundreds of millions of dollars.
Speaker 1 You know what I mean?
Speaker 1 So
Speaker 1 that like that thing is like very bottlenecked on that kind of person taking the time almost in like aristocratic tutoring sense of
Speaker 1 finding and then getting up to speed.
Speaker 1 And it seems like if it worked this well, it should be done at scale. It should be the responsibility of key people to,
Speaker 1
you know what I mean, on board and find. I think that is true to many extents.
I'm sure you probably benefited a lot from the key researchers mentoring you deeply dream.
Speaker 1
And like actively looking on open source repositories or like on forums or whatever for like potential people like this. Yeah.
I mean, James has like Twitter injected into his brains.
Speaker 1 People call his brain violence.
Speaker 1
But yes, and I think this is something which in practice is done. Like people do look out for people that they find interesting and try and find high signal.
In fact, actually,
Speaker 1 I was talking about this with Jeff the other day. And Jeff said that, yeah, he's like, you know,
Speaker 1
one of the most important hires I ever made was off a cold email. And I was like, well, who was that? And he's Chris Ottler.
Ah, yeah.
Speaker 1 Because Chris similarly had had no background in,
Speaker 1 well, like, no formal background in ML, right? And like Google Brain was just getting started and this kind of thing.
Speaker 1 But Jeff saw that signal. And the residency program, which like Brain had, is, I think, also like a,
Speaker 1 it was astonishingly effective at finding good people that didn't have strong ML backgrounds.
Speaker 1 And.
Speaker 1 Yeah.
Speaker 1 One of the other things that I want to emphasize for a potential slice of of the audience that would be relevant to is
Speaker 1 there's this sense that the world is legible and efficient of
Speaker 1 companies have these
Speaker 1 go to jobs.google.com or jobs.whatevercompany.com and you apply and there's the steps and like they will evaluate you efficiently on those steps.
Speaker 1
Whereas not only from the story, it seems like often that's not the way it happens. That's in fact it's good for the world that that's not often how it happens.
Like it is important to look at
Speaker 1 were they able to write an interesting blog, technical blog post about their research or like make interesting contributions.
Speaker 1 Yeah, I want you to like riff on
Speaker 1 for the people who are like just assuming that the other end of the job board is like just like super legible and mechanical, this is not how it works.
Speaker 1 And in fact, like people are looking for this sort of different way, a different kind of person who's augentic and putting stuff out there.
Speaker 1 And I think specifically what people are looking for there is two things. One is agency and like putting yourself out there, uh, and the second is the ability to do world-class something, yeah.
Speaker 1 Um, and two examples that I always like to point to here are um, Andy Jones from Anthropic did an amazing paper, um, on scaling laws as applied to board games.
Speaker 1 It didn't require much resources, it demonstrated incredible engineering skill, it demonstrated incredible understanding of like the most topical problem of the time.
Speaker 1 Um, and he didn't come from a like typical academic background or whatever.
Speaker 1 And as I understand it, basically, like as soon as he came out with that paper, both Anthropic and OpenAI were like, we we would desperately like to hire you.
Speaker 1 There's also someone who works on Anthropic's performance team now, Simon Bohm, who has written, in my mind, the reference for optimizing a CUDA map model
Speaker 1 on a GPU.
Speaker 1 And that
Speaker 1 demonstrated example of taking some prompt effectively and producing the world-class reference example for it in something that wasn't particularly well done so far is like I think an incredible demonstration of like ability and agency
Speaker 1 that in my mind would be an immediate would like, please love to interview you slash hire.
Speaker 1
Yeah. The only thing I can add here is I mean, I still had to go through the whole hiring process and all the standard interviews and this sort of thing.
Yeah, everyone does. Everyone does.
Speaker 1 Doesn't that seem stupid?
Speaker 1
I mean, it's important de-biasing. Yeah, yeah, yeah.
And the bias is what you want, right? Like you want the bias of somebody who's got great taste. And he's like,
Speaker 1 who cares? Your interview process should be able to disambiguate that as well. Yeah.
Speaker 1 I think there are cases where someone seems really great and then it's like, oh, they actually just can't code, this sort of thing, right?
Speaker 1 Like how much you weight these things definitely matters, though. And like, I think
Speaker 1
we take references really seriously. The interviews you can only get so much signal from.
And so it's all these other things that can come into play for whether or not a hire makes sense.
Speaker 1 But you should design your interviews such that
Speaker 1 they test the right things. One man's bias is another man's taste, you know?
Speaker 1 I guess the only thing I would add to this, or maybe to the headstrong context is like there's this line, the system is not your friend.
Speaker 1 And it's not necessarily to say it's actively against you or it's your sworn enemy.
Speaker 1 It's just not looking out for you.
Speaker 1 And so I think that's where a lot of the proactiveness comes in of like, there are no adults in the room or like, and like you have to
Speaker 1 come to some decision for what you want your life to look like and execute on it. And yeah, hopefully you can then update later
Speaker 1 if if you're too headstrong in the wrong way. But I think you almost have to just kind of charge at certain things
Speaker 1 to get much of anything done, not be swept up in the tide of whatever the expectations are.
Speaker 1 There's like one final thing I want to add, which is like we talked a lot about agency and this kind of stuff, but I think actually, surprisingly enough, one of the most important things is just caring an unbelievable amount.
Speaker 1 And when you care an unbelievable amount, you check all the details and you have this understanding of what could have gone wrong. And
Speaker 1 it just it matters more than you think because people end up not caring um or not caring enough uh this is like lebron quote where he talks about how when he sort of before he started in the league he was like worried that everyone would be like incredibly good and and then he gets there and he like realizes that actually once people hit financial stability then they um like they relax a bit and he's like oh this is gonna be easy um and i don't think that's quite true because i think in like ai research because most people actually care quite deeply um but there's caring about your problem and there's also just caring about the entire stack and everything that goes up and down, like going, explicitly going and fixing things that aren't your responsibility to fix because overall it makes like the stack better.
Speaker 1 I mean, another part of that I forgot to mention is you were mentioning going in on weekends and on Christmas break and you get to like the only people in the office are Jeff Dean and Sergey Brand or something
Speaker 1 and you just get to pay a program with them.
Speaker 1 It's just It's interesting to me the people, I don't want to pick on your company in particular, but like people at any big company, they've gotten there because they've gone through a very selective process that's like they had to compete in high school, they got to compete in college.
Speaker 1 But it almost seems like they get there and then they take it easy. When in fact, this is a time to put the pedal to the metal, go in and pair program with Sergei Britton on the weekends or whatever.
Speaker 1 You know what I mean? I mean, there's pros and cons there, right? I think many people make the decision that the thing that they want to prioritize is like a wonderful life with their family.
Speaker 1 And if they do wonderful work, like let's say they don't work every hour of the day, right? But they do wonderful work in the work, like the hours that they do do. That's incredibly impactful.
Speaker 1 I think this is true for many people at Google. It's like maybe they don't work as many hours as like your typical startup mythologized, right? But the work that they do do is incredibly valuable.
Speaker 1 It's very high leverage because they know the systems and they're experts in their field. And we also need people like that.
Speaker 1 Like our world rests on these huge, like difficult to manage and difficult to fix systems.
Speaker 1 And we need people who are like willing to work on and help and fix and maintain those in frankly a thankless way that isn't as high publicity as all of this AI work that we're doing.
Speaker 1 And I'm like ridiculously grateful that those people do that. And I'm also happy that there are people for whom, like, okay, they find technical fulfillment in their job and doing that well.
Speaker 1 And they're also like, maybe they draw a lot more fulfillment also out of spending a lot of hours with their family.
Speaker 1 And I'm lucky that I'm at a stage in my life where I can go in and work every hour of the week. But
Speaker 1 that's, like,
Speaker 1 I'm not making as many sacrifices to do that.
Speaker 1 Yeah.
Speaker 1 I mean, like, just one example that sticks out in my mind of this sort of like
Speaker 1 the other side says no, and you can still get the yes on the other end.
Speaker 1 Basically, every single high-profile guest I've gone so far, I think maybe with one or two exceptions, I've sat down for a week and I've just come up with a list of sample questions that's, you know, like tried to come up with really smart questions to sand to them.
Speaker 1 And the entire process, I've always thought, like,
Speaker 1 if I just cold email them, it's like a 2% chance they say yes. If I include this list, there's a 10% chance.
Speaker 1 And because otherwise, you know, there's like you go through their inbox, and every 34 seconds, there's an interview for whatever podcast, an interview for whatever podcast.
Speaker 1
And every single time I've done this, they've said yes. Right? Yeah.
You just like
Speaker 1
straight questions. But if you do everything, you'll win.
But you just like, you literally have to dig in the same hole for like 10 minutes.
Speaker 1 Or in that case, like, make a list of sample questions for them to get past their not an idiot list. You know what I mean?
Speaker 1 And just demonstrate how much you care and
Speaker 1 the work you're willing to put in.
Speaker 1 Something that a friend said to me a while back, but I think is stuck is like, it's amazing how quickly you can become world class at something, just because most people aren't trying that hard and like are only working like, I don't know, the actual like 20 hours that they're actually spending on this thing or something.
Speaker 1 And so, yeah, if you just go ham, then like you can, you can get really far pretty fast. And I think I'm lucky I had that experience with the fencing as well.
Speaker 1 Like I had the experience of becoming world class in something and, like, knowing that if you just worked really, really hard and were like.
Speaker 1
For our context, by the way, Sholto was one seat away as he was the next person in line to go to the Olympics for fencing. I was at best like 42nd in the world for fencing.
For fence foil fencing.
Speaker 1 Mutational load is a thing, man.
Speaker 1 And
Speaker 1 there was one cycle where, yeah, I was like the next highest ranked person in Asia. And if one of the teams had been
Speaker 1 disqualified for doping, as it was occurring in part during that cycle,
Speaker 1 and as occurred for the Australian women's rowing team, I think went because one of the teams was disqualified, then I would have been the next in line.
Speaker 1 It's interesting when you just find out about people's prior lives, and it's like, oh, you know, this guy was almost an Olympian, this other guy was whatever, you know what I mean?
Speaker 1 Okay, let's talk about intermeability. Yeah.
Speaker 1 I actually want to stay on the brain stuff as a way to get into it for a second.
Speaker 1 We were previously discussing:
Speaker 1 is the brain organized in the way where you have a residual stream that is gradually
Speaker 1 refined with higher level associations over time, or something?
Speaker 1 There's a fixed dimension size in a model.
Speaker 1 If you had to, I don't even know how to ask this question in a sensible way, but what is the D model of the brain?
Speaker 1 What is it like the embedding size of, or because of feature splitting, is that not a sensible question?
Speaker 1 No, I think it's a sensible question. Well, it is a question that makes sense.
Speaker 1 You could have just not said that.
Speaker 1 No, no, it's just a question. You can touch like actively.
Speaker 1 I'm trying to,
Speaker 1 I don't know how you would begin to kind of be like, okay, well, this part of the brain is like a vector of this dimensionality.
Speaker 1 I mean, maybe for the visual stream, because it's like V1 to V2 to IT, whatever,
Speaker 1
you could just count the number of neurons that are there and be like, that is the dimensionality. But it seems more likely that there are kind of sub-modules and things are divided up.
So,
Speaker 1 yeah, I don't have, and
Speaker 1 I'm not like the world's greatest neuroscientist, right? Like, I did it for a few years. I like studied the cerebellum quite a bit.
Speaker 1 So, I'm sure there are people who could give you a better answer on this.
Speaker 1 Do you think that the way to think about whether it's in the brain or whether it's in these models,
Speaker 1 fundamentally what's happening is like features are added, removed, changed, and the feature is the fundamental unit of what is happening in the model? Like what would have to be true for
Speaker 1 give me a, and this goes back to the earlier thing we were talking about, whether it's just associations all the way down.
Speaker 1 Give me like a counterfactual in the world where this is not true, what is happening instead? Like what is the alternative hypothesis here?
Speaker 1 Yeah, it's it's hard for me to think about because at this point, I just think so much in terms of this feature space.
Speaker 1 I mean,
Speaker 1 at one point, there was like the kind of behavioralist approach towards cognition, where, or it's like you're just
Speaker 1 input-output, but you're not really doing any processing, or it's like everything is embodied and you're just like a dynamical system that's like operating
Speaker 1 along like some predictable equations, but like there's no state in the system, I guess.
Speaker 1 But whenever I've read these sorts of critiques, it's like, well, you're just choosing to not call this thing a state, but you could call any internal component of the model a state.
Speaker 1 Even with the feature discussion,
Speaker 1 defining what a feature is is really hard.
Speaker 1 And so
Speaker 1 the question feels almost too slippery.
Speaker 1 What is a feature?
Speaker 1 A direction and activation space,
Speaker 1 a latent variable that is operating behind the scenes that has causal influence over the system you're observing.
Speaker 1 It's a feature, if you call it a feature. It's tautological.
Speaker 1 I mean, these are all explanations that
Speaker 1 I feel some
Speaker 1
associated. In a very rough, intuitive sense, in a sufficiently sparse and binary vector, features like whether or not something's turned on or off, right? Right.
Like in a very simplistic sense.
Speaker 1 Yeah.
Speaker 1 Which might be, I think, a useful metaphor to understand it by.
Speaker 1 It's like when we talk about features activating, it is in many respects the same way that neuroscientists would talk about like a neuron activating, right?
Speaker 1
If that neuron corresponds to something in particular. Right.
Yes, yeah, yeah. And no, I think that's useful as like, what do we want a feature to be?
Speaker 1 Like, what is the synthetic problem under which a feature exists? But
Speaker 1 even with the Torrance Mono Semanticity work, we talk about what's called feature splitting, which is basically you will find as many features as you give the model the capacity to learn.
Speaker 1 And by model here, I mean the
Speaker 1 up projection that we fit after we trained the original model. And so if you don't give it much capacity, it'll learn a feature for bird.
Speaker 1 But if you give it more capacity, then it will learn like ravens and eagles and sparrows and specific types of birds.
Speaker 1 Still on the definitions thing,
Speaker 1 I guess naively I think of things like bird versus
Speaker 1 what kind of token is it like a is it like a period at the end of a hyperlink as you were talking about earlier
Speaker 1 versus at the highest level things like love or deception or
Speaker 1 like holding a very complicated proof in your head or something is this all features because then the definition seems so broad as to almost be not that useful
Speaker 1 like i rather that there's seems to be some important differences between these things and if they're all features like yeah i'm not sure what we would mean by
Speaker 1 i mean all of those things are like discrete units that have connections to other things that then imbues them with meaning um
Speaker 1 i that feels like a a specific enough definition that it's it's useful or not uh too all-encompassing but feel free to push back what like what would you discover tomorrow in um uh that could make you think like oh this is like kind of fundamentally the wrong way to think about what's happening in a model.
Speaker 1 I mean, if the features we were finding weren't predictive, or if they were just representations of the data,
Speaker 1 right, where it's like, oh,
Speaker 1 all you're doing is just clustering your data,
Speaker 1 and there's no higher level associations that are being made, or it's some like phenomenological thing of like,
Speaker 1 you're saying that this feature fires for marriage, but if you activate it really strongly, it doesn't change the outputs of the model in a way that would correspond to it.
Speaker 1 Like, I think these would both be good critiques.
Speaker 1 I guess one more is,
Speaker 1 and we tried to do experiments on MNIST, which is a data set of digits, images, and we didn't look super hard into it.
Speaker 1 And so I'd be interested if other people wanted to take up a deeper investigation.
Speaker 1 But it's plausible that your latent space of representations is dense, and it's a manifold instead of being these discrete points.
Speaker 1 And so you could move across the manifold, but at every point, there would be some meaningful behavior.
Speaker 1 And it's much harder than to label things as features that are discrete.
Speaker 1 In a naive sort of outsider way, the thing that would seem to me to be a way in which this picture could be wrong is if there's not some like
Speaker 1 this thing is turned on, turned off, but it's like a much more global kind of like
Speaker 1 the system is a, I'm going to use really clumsy, like, you know, I mentioned it in a party kind of language, but
Speaker 1 is there a good analogy here?
Speaker 1 Yeah, I guess if you think of like something like the laws of physics, it's not like, well, the feature for wetness is turned on, but it's only turned on this much, and then the feature for like,
Speaker 1 you know,
Speaker 1 I guess maybe it's true because like the mass is like a gradient and like,
Speaker 1 you know, like, I don't know, but the polarity or whatever is a gradient as well.
Speaker 1 But there's also a sense in which like there's the laws and the laws are more general and you have to understand like the general bigger picture. You don't get that from just these specific sub
Speaker 1 sub-circuits. But that's where the reasoning circuit itself comes into play, right? Where you're taking these features ideally and trying to compose them into something higher level.
Speaker 1 You might say, okay,
Speaker 1 when I'm using, at least this is my head cannon.
Speaker 1 Let's say I'm trying to use the foot, you know, F equals MA, right?
Speaker 1 Then I'm presumably at some point I have features which like, denote, okay, like mass, and then that's like helping me retrieve the actual mass of the thing that I'm using, and then like,
Speaker 1 the acceleration and this kind of stuff. But then also,
Speaker 1 maybe there's a higher level feature that does correspond to using the first law of physics. Maybe, but the more important part is that the composition of components, which helps me retrieve
Speaker 1 relevant pieces of information and then produce like maybe some like a you know multiplication operator or something like that when necessary. At least that's my like head canon.
Speaker 1 What is a compelling explanation to you, especially for very smart models of
Speaker 1 like I understand why it made this output and it was like for a legit reason.
Speaker 1 If it's doing million line pull requests or something, what are you seeing at the end of that request where you're like, yep, that's chill?
Speaker 1 Yeah, so ideally, you apply dictionary learning to the model.
Speaker 1 You've found features.
Speaker 1 Right now, we're actively trying to get the same success for attention heads, in which case we have features for both the core.
Speaker 1 You can do it for residual stream, MLP, and attention throughout the whole model.
Speaker 1 Hopefully, at that point, you can also identify broader circuits through the model that are like more general reasoning abilities that will activate or not activate.
Speaker 1 But in your case where we're trying to figure out if this like pull request should be approved or not,
Speaker 1 I think you can flag or detect features that correspond to deceptive behavior, malicious behavior, these sorts of things, and see whether or not those have fired.
Speaker 1 That would be like an immediate, you can do more than that, but that would be an immediate.
Speaker 1 But before I trace down on that,
Speaker 1 what does the reasoning circuit look like? Like, what would that look like when you found it? Yeah, so I mean, the induction head is probably one of the simplest things. Let's say reasoning, right?
Speaker 1 Well, I mean, what do you call reasoning, right? Like
Speaker 1 it's a good reason.
Speaker 1
So I guess context for listeners, the induction head is basically, and you see the line, like, Mr. and Mrs.
Dursley did something, Mr. blank, and you're trying to predict what blank is.
Speaker 1 And the head has learned to look for previous occurrences of the word Mr.
Speaker 1 look at the word that comes after it, and then copy and paste that as the prediction for what should come next, which is a super reasonable thing to do. And there is computation being done there
Speaker 1 to accurately predict the next token.
Speaker 1
Mm-hmm. But yeah, that is context-dependent.
That is, yeah,
Speaker 1 but it's not like
Speaker 1 reasoning, you know what I mean?
Speaker 1 But is,
Speaker 1 I guess, going back to the associations all the way down, it's like if you chain together a bunch of these reasoning circuits or
Speaker 1 heads that have different rules for how to relate information.
Speaker 1 But in this sort of like zero-shot case,
Speaker 1 like something is happening where when you pick up a new game and you immediately start understanding how to play it, and it doesn't seem like an induction heads kind of thing.
Speaker 1 Well, I think there would be another circuit for like extracting pixels and turning them into latent representations of the different objects in the game, right?
Speaker 1 And like a circuit that is learning physics. And what would that because the induction heads is like one layer transformer?
Speaker 1 Either two layers. right? Yeah, yeah.
Speaker 1 So you can like kind of see like what that, like the thing that is a human picks up a new game and understands it.
Speaker 1 How would you think about what that is?
Speaker 1 Presumably it's across multiple layers, but like
Speaker 1 is it, yeah, like what would that physically look like?
Speaker 1 How big would it be maybe?
Speaker 1 I mean, that would just be an empirical question, right? Of like, how big does the model need to be to perform this task?
Speaker 1 But like, I mean, maybe it's useful if I just talk about some other circuits that we've seen. So we've seen like
Speaker 1 the IOI circuit, which is the indirect object identification. And so this is like, if you see, it's like Mary and Jim went to the store, Jim gave the object to blank, right?
Speaker 1 And it would predict Mary, because Mary's appeared before as like the indirect object, or
Speaker 1 it'll infer pronouns, right?
Speaker 1 And
Speaker 1 this circuit even has behavior where like if you ablate it, then like other heads in the model will pick up that behavior.
Speaker 1 We'll even find heads that want to do copying behavior and then other heads will suppress. So it's
Speaker 1 one head's job to just always copy the token that came before, for example, or the token that came five before or whatever. And then it's another head's job to be like, no, do not copy that thing.
Speaker 1 So
Speaker 1
there are lots of different circuits performing, in these cases, pretty basic operations. But when they're chained together, you can get unique behaviors.
And
Speaker 1 is the story of how you found it with the reasoning thing is like, because you won't be able to understand, or it'll just be like really con,
Speaker 1 you know, it won't be something you can see in like a two-layer transformer. So will you just be like,
Speaker 1 the circuit for deception or whatever, it just,
Speaker 1 this part of the network fired when we at the end identified the thing as being deceptive. This part, and it didn't fire when we didn't identify it as being deceptive.
Speaker 1 Therefore, this must be the deception circuit?
Speaker 1 I think a lot of analysis like that. Like Anthropic has done quite a bit of research before on sycophancy, which is like the model saying what it thinks you want to hear.
Speaker 1 And that political opinion at the end to be able to label which one is like bad and which one is good.
Speaker 1 Yeah, so we have tons of instances, and actually, as you make models larger, they do more of this.
Speaker 1 Where the model is clearly,
Speaker 1 it has
Speaker 1 features that model another person's mind.
Speaker 1 And these activate, and like some subset of these
Speaker 1 we're hypothesizing here, but like would be associated with more deceptive behavior.
Speaker 1 Although, like, it's doing that by, I don't know, Chat GPT, I think, is probably modeling me because that's like RLA
Speaker 1 to theory of design.
Speaker 1
Yeah. So, well, first of all, the thing you mentioned earlier about there is redundancy.
So, then it's like, well, have you caught
Speaker 1 the whole thing that could cause deception of the whole thing? Or is it just one instance of it?
Speaker 1
Yeah. Second of all, are your labels correct? You know, maybe like you, you thought this wasn't deceptive.
It's like still deceptive, especially if it's producing output you can't understand.
Speaker 1
Third, is the thing that's going to be the bad outcome something that's even hooven understandable? Like deception is a concept we can understand. Maybe there's like a.
Yeah, yeah.
Speaker 1
So a lot to unpack here. So I guess a few things.
One, it's fantastic that these models are deterministic. When you sample from them, it's stochastic, right?
Speaker 1 But like I can just keep putting in more inputs and ablate every single part of the model. This is kind of the pitch for computational neuroscientists to come and work on interpretability.
Speaker 1 It's like you have this alien brain and you have access to everything in it and you can just ablate however much of it you want.
Speaker 1 And so I think if you do this carefully enough, you really can start to pin down what are the circuits involved, what are the backup circuits, these sorts of things.
Speaker 1 The kind of cop-out answer here, but it's important to keep in mind, is doing automated interpretability.
Speaker 1 So it's like as our models continue to get more capable, having them assign labels or like run some of these experiments at scale.
Speaker 1 And then with respect to like, if there's superhuman performance, how do you detect it? Which I think was kind of the last part of your question.
Speaker 1 Aside from the cop-out answer,
Speaker 1 if we buy this associations all the way down, you should be able to coarse grain the representations at a certain level such that they then make sense.
Speaker 1 I think it was even in Demis's podcast, he's talking about like if a chess player makes a superhuman move, they should be able to distill it into reasons why they did it.
Speaker 1 Even if the model is not going to tell you what it is,
Speaker 1 you should be able to decompose that complex behavior into simpler circuits or features to really start to make sense of why it did the thing that it did.
Speaker 1 Aaron Ross Powell, there's a separate question of does such representation exist,
Speaker 1 which it seems like there must, or actually, I'm not sure if that's the case. And secondly, whether using this sparsewatt encoder setup, you could find it.
Speaker 1 And in this case, if you don't have labels for it that are adequate to represent it, like you wouldn't find it, right?
Speaker 1 Yes and no.
Speaker 1 So like we are actively trying to use dictionary learning now on the sleeper agents work which we talked about earlier and it's like if i just give you a model can you tell me if there's this trigger in it and it's going to start doing interesting behavior and and it's an open question whether or not when it learns that behavior it's part of a more general circuit so that we can pick up on without actually getting activations for and having it display that behavior right because that would kind of be cheating then um
Speaker 1 or if it's learning some hacky trick over, like that's a separate circuit that you'll only pick up on if you actually have it do that behavior.
Speaker 1 But even in that case, the geometry of features gets really interesting
Speaker 1 because
Speaker 1 fundamentally, each feature
Speaker 1 is in some part of your representation space, and they all exist with respect to each other.
Speaker 1 And so in order to have this new behavior, you need to carve out some subset of the feature space for the new behavior and then push everything else out of the way to make space for it.
Speaker 1 So hypothetically, you can imagine you have your model before you've taught it this bad behavior. You know all the features or have some coarse-grained representation of them.
Speaker 1 You then fine-tune it such that it becomes malicious. And then you can kind of identify this black hole region of feature space where everything else has been shifted away from it.
Speaker 1 And there's this region and you haven't put in an input that causes it to fire. But then you can start searching for what is the input that would cause this part of the space to fire.
Speaker 1 What happens if I activate something in this space? There are a whole bunch of other ways that you can try and attack that problem.
Speaker 1 This is sort of a tangent, but one interesting idea I heard was: if that space is shared between models, you can imagine trying to find it in an open source model to then make, like, Gemma is, they said in the paper, Gemma, by the way,
Speaker 1 Google's a newly released open source model. They said in the paper, it's trained using the same architecture or something like that.
Speaker 1 To be honest, I don't know because I haven't read the Gemma paper.
Speaker 1 Similar methods, something whatever, as Gemini. So to the extent that's true, I don't know,
Speaker 1 how much of the rec teaming you do on Gemma is like potentially helping you jailbreak into Gemini. Yeah, this gets into the fun space of like how universal are features across models.
Speaker 1 And our Torzomonosemanticity paper looked at this a bit.
Speaker 1 And we find, I can't give you summary statistics, but like the base64 feature, for example, which we see across a ton of models, this is like, there are actually three of them, but they'll fire four and model base64 encoded text, which is prevalent in like every URL.
Speaker 1 And there are lots of URLs in the training data.
Speaker 1
They have really high cosine similarity across models. So they all learn this feature.
And I mean within a rotation, right? But it's like, yeah, yeah, yeah.
Speaker 1
Like the actual vectors itself. Yeah, yeah.
And I wasn't part of this analysis,
Speaker 1 but yeah,
Speaker 1 it definitely finds the feature and they're like pretty similar to each other across two separate, two models, the same model architecture, but trained with different random seeds.
Speaker 1
It supports the quanta theory of neural scaling. It's like...
a hypothesis, right? Which is that like all models on like a similar data set will learn the same features in the same order-ish, roughly.
Speaker 1 Like, you learn your engrams, you learn your induction heads, and you learn like to put full stops after numbered lines and this kind of stuff. Hey, but by the way, okay, so this is another tangent.
Speaker 1 To the extent that that's true, and like I guess there's evidence that that's true, why doesn't curriculum learning work?
Speaker 1 Because if it is the case that you learn certain things first, shouldn't just directly training those things first lead to better results?
Speaker 1
Both Gemini papers mention some like aspects of curriculum learning. Okay, interesting.
I mean, the fact that fine-tuning works is like evidence for curriculum learning, right?
Speaker 1 Because the last things you're training on have a disproportionate impact.
Speaker 1
I wouldn't necessarily say that. There's one mode of thinking in which fine-tuning is specialized.
You've got this latent bundle of capabilities and you're specialized for its particular
Speaker 1 use case that you have.
Speaker 1 I'm not sure how true it is. I think the David Bell Lab kind of paper kind of supports this, right?
Speaker 1
Like you have that ability and you're just getting better at entity recognition, like fine-tuning that circuit instead of other ones. Yeah.
Yeah.
Speaker 1 Sorry, what was the thing we were talking about before? But generally, I do think curriculum learning is really interesting that people should explore more. And it seems very plausible.
Speaker 1 I would really love to see more analysis along the lines of the quantum theory stuff where and like understanding better what do you actually learn at each stage and like decomposing that out and exploring whether or not curricula change that.
Speaker 1 By the way, I just realized, forgot,
Speaker 1 I just got in conversation mode and forgot there's an audience.
Speaker 1 Curriculum learning is when you organize a data set, when you think about a human, how they learn, they don't just see like random wiki text and they just like try to predict it, right?
Speaker 1 They're like, we'll start you off with like
Speaker 1 Lorax or something, and then you'll learn.
Speaker 1 I don't even remember what first grade was like, but you'll learn the things that first graders learn, and then like second graders and so forth.
Speaker 1 And so you would imagine.
Speaker 1 We know you never go past first grade.
Speaker 1 Okay, anyways,
Speaker 1 let's get back to like the big before I get into like a bunch of like interim details. The big picture,
Speaker 1 there's two threads I want to explore.
Speaker 1 First is, I guess it makes me a little worried that there's not even an alternative formulation of what could be happening in these models that could invalidate this approach, which feels like, I mean, we do know that we don't understand intelligence, right?
Speaker 1 Like there are definitely unknown unknowns here. So
Speaker 1 like the fact that there's not a null hypothesis, I don't know, I feel like,
Speaker 1
what if we're just wrong and we don't even know the way in which we're wrong, which actually increases the uncertainty? And yeah. Yeah.
Yeah.
Speaker 1 So it's not that there aren't other hypotheses. It's just I have been working on superposition for like a number of years
Speaker 1 and very involved in this effort. And so I'm less sympathetic to or will like they're wrong
Speaker 1
to these other approaches, especially because our recent work has been so successful. Yeah, it is.
And like quite high explanatory power.
Speaker 1 Like there's this beautiful, like in the scaling laws paper, there's this little bump at a particular, like the original scaling laws paper, there's a little bump.
Speaker 1 And that apparently corresponds to when the model learns induction heads. And then like after that, it like sort of goes off track, learns induction heads, gets back on track,
Speaker 1 which is like an incredible piece of retroactive explanatory power. Yeah.
Speaker 1 Before I forget it, though, I do have one thread on feature universality that you might want to have in.
Speaker 1 So there are some really interesting behavioral evolutionary biology experiments on like, should humans learn a real representation of the world or not you can imagine a world in which we saw all venomous animals as like flashing neon pink a world in which we survive better and so it would make sense for us to not have a realistic representation of the world
Speaker 1 and they there's some work where they'll simulate like little basic agents
Speaker 1 and see if the representations they learn like map to
Speaker 1 the the like tools they can use and like the inputs they should have and it turns out if you have these little agents perform more than a certain number of tasks, given these basic tools and objects in the world, then they will learn a ground truth representation because
Speaker 1 there are so many possible use cases that you need for these base objects that you actually want to learn what the object actually is and not some like cheap visual heuristic or other thing.
Speaker 1 And so to the extent that we are doing, and we haven't talked at all about like Friston's free energy principle or predictive coding or anything else, but like to the extent that all living organisms are trying to like actively predict what comes next and form like a really accurate world model um
Speaker 1 it it wouldn't surprise me or i'm optimistic that um we are learning genuine features about the world that are good for modeling it and our language models will do the same at least especially because we're training them on human data and human text.
Speaker 1 Another dinner party question.
Speaker 1 Should we be less worried about misalignment and maybe that's not even the right word for what I'm referring to, but like just alienness and shoggethness from these models, given that there is feature universality and there are certain ways of thinking and ways of understanding the world that are instrumentally useful to different kinds of intelligences?
Speaker 1 Should we just be less worried about like bizarro paper code maximizers as a result?
Speaker 1 I think that's the, this is kind of why I bring this up as like the optimistic take.
Speaker 1 Predicting the internet is very different from what we're doing though, right? Like the models are way better at predicting next tokens than we are. They're trained on so much garbage.
Speaker 1 They're trained on so many URLs. In the dictionary learning work, we find there are like three separate features for base64 encodings.
Speaker 1 And even that is kind of an alien example that it's probably worth me talking about for a minute.
Speaker 1 One of these base64 features fired for
Speaker 1
numbers. One, like other base64, like if it sees base64 numbers, it'll predict more of those.
Another fired for letters. But then there was this third one that we didn't understand.
Speaker 1 And it fired for a very specific subset of base64 features. And someone on the team who clearly knows way too much about base64 realized that this was the subset that was ASCII decodable.
Speaker 1 So you could decode it back into the ASCII characters.
Speaker 1 And the fact that the model learned these three different features and it took us a little while to figure out what was going on
Speaker 1 is very Shogoth-esque.
Speaker 1 That
Speaker 1 it has a denser representation of regions that are particularly relevant to predicting the next token. Yeah, because it's so, yeah,
Speaker 1 and it's clearly doing something that humans wouldn't, right? Like, you can even talk to any of the current models in base64 and it will reply in base64. Right.
Speaker 1 And you can then decode it and it works great.
Speaker 1 That particular example, I wonder if that implies that the
Speaker 1 difficulty of doing interoperability on smarter models will be harder because if
Speaker 1 it requires somebody with esoteric knowledge who just happened to see that base64 has, I don't know, like whatever that distinction was, doesn't it apply when you have the million line pull requests?
Speaker 1 It's like there is no human that's going to be able to decode like two different reasons why the pull request, there's like two different features for this pull request.
Speaker 1 Yeah, you know what I mean? Like,
Speaker 1 yeah. So if you think you type a comment, like small CLs, please, lag.
Speaker 1 Yeah, exactly. No, no, I mean, you could do that, right? This is like, what I was going to say is like one technique here is anomaly detection, right? Yeah.
Speaker 1 And so one beauty of dictionary learning instead of like linear probes is that it's unsupervised.
Speaker 1 You are just trying to learn to span all of the representations that the model has and then interpret them later.
Speaker 1 But if there's a weird feature that suddenly fires for the first time that you haven't seen fire before, that's a red flag.
Speaker 1 You could also coarse grain it so that it's just a single base 64 feature.
Speaker 1 I mean, even the fact that this came up and we could see that it specifically favors these particular outputs and it fires for these particular inputs gets you a lot of the way there.
Speaker 1 I'm even familiar with cases from the auto-interp side where a human will look at a feature and try to annotate it. For it fires for
Speaker 1
Latin words. And then, when you ask the model to classify it, it says it fires for Latin words defining plants.
So it can already beat the human in some cases for labeling what's going on.
Speaker 1 So at scale, this would require an adversarial
Speaker 1 thing between models where some model, you have millions of features potentially for GPD-6, and
Speaker 1 just, a bunch of models are just trying to figure out what each of these features means. How
Speaker 1 yeah, but you can even automate this process, right? I mean, it's this goes back to the determinism of the model.
Speaker 1 Like you could have a model that is actively editing input text and predicting if the feature is going to fire or not and figure out what makes it fire, what doesn't, and like search the space.
Speaker 1 Yeah. I want to talk more about the feature splitting because I think that's like an interesting thing that has been under
Speaker 1 explored, especially for scalability. I think it's underappreciated, right?
Speaker 1 First of all,
Speaker 1 how do we even think about, is it really just
Speaker 1 you can keep going down and down? Like, there's no end to the amount of features?
Speaker 1 I mean, so at some point, I think you might just start fitting noise
Speaker 1
or things that are part of the data, but that the model isn't actually. By the way, do you want to explain what feature splitting is? Yeah, yeah.
So it's the part before
Speaker 1 where
Speaker 1
the model will learn however many features it has capacity for that still span the space of representation. So like give an example potentially of that.
Yeah, yeah.
Speaker 1 So you learn, if you don't give the model that much capacity for the features it's learning, concretely, if you project to not as high a dimensional space, it will learn one feature for birds.
Speaker 1 But if you give the model more capacity, it will learn features for all the different types of birds.
Speaker 1 And so it's more specific than otherwise.
Speaker 1 And oftentimes, like, there's the bird vector that points in one direction, and all the other specific types of birds point in a similar region of the space, but are obviously more specific than the course label.
Speaker 1 Okay, so let's go back to GPD-7.
Speaker 1 First of all, is this a sort of like linear tax on any model to figure out,
Speaker 1 actually, even before that, is this a one-time thing you had to do, or is this the kind of thing you have to do on every output?
Speaker 1 Or is it like one-time, it's not deceptive, we're good to roll.
Speaker 1 Actually, yeah, let me let you answer that.
Speaker 1 Yeah, so you do dictionary learning after you've trained your model, and you feed it a ton of inputs and and you get the activations from those and then you do this projection into the higher dimensional space and so the method is it's unsupervised in that it's trying to learn these sparse features you're not telling them in advance what they should be but it is constrained by the inputs you're giving the model
Speaker 1 I guess two caveats here one like we can
Speaker 1 try and choose what inputs we want. So if we're looking for theory of mind features that might lead to deception, we can put in the sick of fancy data set.
Speaker 1 Hopefully at some point, we can move into looking at the weights of the model alone, or at least using that information to do dictionary learning.
Speaker 1 But I think in order to get there, that's like such a hard problem that you need to make traction on just learning what the features are first.
Speaker 1 But yeah, so what's the cost of those?
Speaker 1 Can you read the last sentence?
Speaker 1 Weights of the model alone.
Speaker 1 So
Speaker 1
right now we just have these neurons in the model. They don't make any sense.
We apply dictionary learning. We get these features out.
They start to make sense.
Speaker 1 But that depends on the activations of the neurons.
Speaker 1 The weights of the model itself, like what neurons are connected to what other neurons, certainly has information in it.
Speaker 1 And the dream is that we can kind of bootstrap towards actually making sense of the weights of the model that are independent of the activations of the data.
Speaker 1 I mean, this is all, I'm not saying we've made any progress here.
Speaker 1 It's a very hard problem, but it feels like we'll have a lot more traction and be able to sanity check what we're finding with the weights if we're able to pull out features first.
Speaker 1 For the audience, weights are permanent, well, I don't know if permanent is the right word, but like they are the model itself, whereas activations are the sort of like artifacts of any single call.
Speaker 1 In a brain metaphor, you know, the weights are like the actual connection scheme between neurons and the activations of the current neurons that are lining up. Yeah, exactly.
Speaker 1 Yeah, okay, so there's going to be two steps to this for GPT-7 or whatever model we're concerned about.
Speaker 1 One,
Speaker 1 actually,
Speaker 1 first, correct me if I'm wrong, but like training the sparse autoencoder and like do the unsupervised projection into a wider space of features that have a higher fidelity to like what is actually happening in the model.
Speaker 1 And then secondly, label those features.
Speaker 1 Because let's say the cost of training the model is N.
Speaker 1 What will those two steps cost relative to N? We will see.
Speaker 1 It really depends on
Speaker 1 two main things. What is your expansion factors? How much are you projecting into the higher dimensional space? And how much data do you need to put into the model?
Speaker 1 How many activations do you you need to give it?
Speaker 1 But this brings me back to the feature splitting to a certain extent, because if you know you're looking for specific features,
Speaker 1
you can start with a really cheaper, like coarse representation. So, maybe my expansion factor is like only two.
So, like, I have a thousand neurons, I'm projecting to a 2,000-dimensional space.
Speaker 1
I get 2,000 features out, but they're really coarse. And so, previously, I had the example for birds.
Let's move that example to like, I have a biology feature,
Speaker 1 but I really care about if the model has representations for bioweapons and is trying to manufacture them. And so what I actually want is like an anthrax feature.
Speaker 1 What you can then do is rather than, and let's say the anthrax, you only see the anthrax feature if instead of going from a thousand dimensions to 2,000 dimensions, I go to a million dimensions, right?
Speaker 1 And so you can kind of imagine this big tree of semantic concepts where like biology splits into like cells versus like whole body biology, and then further down it splits into all these other things.
Speaker 1 So rather than needing to immediately go from a thousand to a million and then picking out that one feature of interest, you can find the direction that the biology feature is pointing in, which again is very coarse, and then selectively search around that space.
Speaker 1 So like only do dictionary learning if this gener
Speaker 1 if something in the direction of the biology feature fires first. And so
Speaker 1 the computer science metaphor here would be like, instead of doing breadth-first search, you're able to do depth-first search, where you're only recursively expanding and exploring a particular part of this semantic tree of features.
Speaker 1 Although, given the way that these features are not organized in
Speaker 1 things that are intuitive for humans, right?
Speaker 1 Like, because we just don't know how to deal with base64, so we don't have that many, you know, we just don't dedicate that much, like, whatever firmware to like
Speaker 1 deconstructing which kind of base64 it is. How would we know that the subjects, and this will go back to maybe the MOE discussion we'll have of,
Speaker 1 I guess we might as well talk about it, but like in Mixture of Experts, the Mixtroll paper talked about how they couldn't find, the experts weren't specialized in a way that we could understand.
Speaker 1 There's not like a chemistry expert or a physics expert or something.
Speaker 1 So why would you think that it will be like biology feature and then deconstruct rather than like blah, and then you just deconstruct and it's like anthrax and
Speaker 1 your like shoes and whatever. So I haven't read the mistral paper, but I think that the heads, I mean, this goes back to like, if you just look at the neurons in a model, they're polysemantic.
Speaker 1 And so if all they did was just look at the neurons in a given head, it's very plausible that it's also polysemantic because of superposition.
Speaker 1 Tug on the thread that Dorcas mentioned there, have you seen in the subtrees when you expand them out, like something in a subtree, which like you really wouldn't guess that it should be there based on the higher level abstraction?
Speaker 1 So this is a line of work that we haven't pursued as much as I want to yet.
Speaker 1
But I think we're planning to, I hope that maybe external groups do as well. Like what is the geometry of features? What's the geometry? Exactly.
And how does that change over time?
Speaker 1 It would really suck if, like, the anthrax feature happened to be below the
Speaker 1
coffee can, like, subtree or something like that. Totally, totally.
And that feels like the kind of thing that you could quickly try and find like proof of, which would then
Speaker 1
mean that you need to then solve that problem. Yeah.
Inject more structure into the geometry. Totally.
I mean, it would really surprise me, I guess, especially given how linear the models seem to be.
Speaker 1 I completely agree. That there isn't some component of the anthrax feature, like vector that is similar to and looks like the biology vector and that they're not in a similar part of the space.
Speaker 1 But yes, I mean, ultimately, machine learning is empirical. We need to do this.
Speaker 1 I think it's going to be pretty important for certain aspects of scaling dictionary learning. Yeah, yeah.
Speaker 1 Interesting.
Speaker 1 On the MOE discussion,
Speaker 1 there's an interesting scaling vision transformers paper that Google put out a little while ago where they do image net classification with an MOE.
Speaker 1
And they find really clear class specialization there for experts. There's a clear dog expert.
But
Speaker 1 the mixture people are just not doing a good job of identifying this?
Speaker 1 I think it's hard.
Speaker 1 And it's entirely possible that
Speaker 1 in some respects, there's almost no reason that all of the different archive features should go to one expert.
Speaker 1 You could have biology, let's say, I don't know what buckets they had in their paper, but let's say they had archive papers as one of the things.
Speaker 1 You could imagine biology papers going here, math papers going here, and all of a sudden your breakdown is ruined.
Speaker 1 But that vision transformer one, where the class separation is really clear and obvious, gives, I think, some evidence towards the specialization hypothesis.
Speaker 1 So I think images are also in some ways just easier to interpret than text. Yeah, exactly.
Speaker 1 And so Chris Ola's interpretability work on AlexNet and these other models, like in the original AlexNet paper, they actually split the model into two GPUs just because they couldn't, like GPUs were so bad back then, relatively speaking, right?
Speaker 1
Like still great at the time. That was one of the big innovations of the paper.
But
Speaker 1 they find branch specialization, and there's a Distill Pub article on this where colors go to one GPU and like
Speaker 1 Gabor filters and line detectors go to the other.
Speaker 1 And then like all of the other.
Speaker 1 Really? Yeah. Yeah, interesting.
Speaker 1 And then all of the other interpretability work that was done,
Speaker 1 like the floppy ear detector, right? Like that just was a neuron in the model that you could make sense of. You didn't need to disentangle superpositionally.
Speaker 1 So, just different data set, different modality.
Speaker 1 I think a wonderful research project to do, if someone is out there listening to this, would be to try and disentangle, like take some of the techniques that Trenton's team has worked on and try and disentangle the neurons in the
Speaker 1
mixture model, which is open source. I think that's a fantastic thing to do because it feels intuitively like there should be.
They didn't demonstrate any evidence that there is.
Speaker 1 There's also, in general, a lot of evidence that there should be specialization.
Speaker 1 Go and see if you can find it.
Speaker 1 That's work that
Speaker 1 Anthropic has published most of his stuff on, like, as I understand it, like dense models, basically. Um,
Speaker 1 you, that is a wonderful research project to try. And given Dwarkesh's success with the Vesuvius Challenge, um, yeah, we should be pitching more projects because they will be solved if we do
Speaker 1 that. What I was thinking about after the Vesuvius Challenge was like,
Speaker 1 wait, I knew, like, Nat had told me about it before it dropped because we recorded the episode before it dropped. Um, why didn't he, why did I not even try? Like,
Speaker 1 you know what I mean? Like, I don't know. Like, Luke is obviously very smart, and like,
Speaker 1 yeah, he's an
Speaker 1
amazing kid, but like, you showed that a 21-year-old on some 1070 or whatever he was working on could do this. I don't know, like, I feel like I should have.
So, you know what?
Speaker 1
Before this episode drops, I'm going to meet my, I'm going to try to make an interpretability researcher out of you. No, no, no, no, I'm not even like trying to go research, really.
I don't know.
Speaker 1
It's like, I was honestly thinking back on experience. Like, wait, I shouldn't, like, why did that end fuck? Yeah, your hands dirty.
Yeah.
Speaker 1 Dual cash's request for research.
Speaker 1
Oh, I want to harp back on this, like, the neuron thing. You said, I think a bunch of your papers have said there's more features than there are neurons.
And this is just like, wait a second.
Speaker 1
I don't know, like a neuron is like weights go in and a number comes out. That's like a number comes out.
You know what I mean? Like that's, that's so little information.
Speaker 1 Like, there's, do you mean like there's like street names and like species and whatever? There's like
Speaker 1 more of those kinds of things than there are like a number comes out in a in a model that's right yeah but how is a number comes out as like so little information how is that encoding for like superposition
Speaker 1 you're just encoding you're encoding a ton of features in these high dimensional vectors in a brain is there like uh an exonifiering or however you think about it like um
Speaker 1 I don't know how you think about like how much like superposition is there in the human brain yeah so Bruno Olshausen who I think of as the leading expert on this yeah thinks that all the brain regions you don't hear about are doing a ton of computation in superposition.
Speaker 1 So everyone talks about V1 as like having Gabor filters and detecting lines of
Speaker 1 various sorts. And no one talks about V2.
Speaker 1 And I think it's because we just haven't been able to make sense of it. What is V2? It's like the next part of the visual processing stream.
Speaker 1 And it's like, yeah, so I think it's very likely. And fundamentally, superposition seems to emerge when you have high-dimensional data that is sparse.
Speaker 1 And to the extent that you think the real world is that, which I would argue it is, we should expect the brain to also be underparameterized in trying to build a model of the world and also use superposition.
Speaker 1 You can get a good intuition for this, and correct me if this example is wrong, in a 2D plane, right?
Speaker 1 Let's say you have two axes, which represents a two-dimensional feature space here, like two neurons, basically.
Speaker 1 And you can imagine them each turning on to various degrees, right? And so that's like your x-coordinate and your y-coordinate. But you can
Speaker 1 now map this onto a plane and you can actually represent a lot of different things in like different parts of the parts of the plane oh okay so uh crucially then superposition is not an artifact of a neuron it is an artifact of like the space that is created combinatorial code yeah yeah exactly yeah okay cool um yeah thanks um
Speaker 1 i i i we kind of talked about this but like i think it's just like kind of wild that It seems to the best of our knowledge the way intelligence works in these models and then presumably also in brains, it's just like, there's a stream of information going through that has quote-unquote features that are infinitely, or at least
Speaker 1 to a large extent, just like splittable. And
Speaker 1 you can expand out a tree of like what this feature is. And what's really happening is a stream, like that feature is getting turned into this other feature, or this other feature is added.
Speaker 1
I don't know. It's like, that's not something I would ever just like thought that's what intelligence is.
You know what I mean? It's like a surprising thing.
Speaker 1 It's not what I would have expected necessarily.
Speaker 1 What did you think it was? I don't know, man.
Speaker 1 I mean, yeah,
Speaker 1 actually, so that's a great segue because all of this feels like go-fi. Like, you're using distributed representations, but you have features and you're applying these operations to the features.
Speaker 1 I mean, the whole field of vector symbolic architectures, which is this computational neuroscience thing,
Speaker 1 all you do is you put vectors in superposition,
Speaker 1 which is literally a summation of two high-dimensional vectors,
Speaker 1 and you create some interference, but if it's high-dimensional enough, then you can represent them.
Speaker 1 And you have variable binding, where you connect one by another, and if you're dealing with binary vectors, it's just the XOR operation. So you have A, B, you bind them together.
Speaker 1 And then if you query with A or B again, you get out the other one. And this is basically...
Speaker 1 the like key value pairs from attention and with these two operations you have a turing complete system which you can if you have enough nested hierarchy, you can represent any data structure you want, et cetera, et cetera.
Speaker 1 Yeah.
Speaker 1 Okay, let's go back to the super intelligence. So, like, walk me through GPD-7.
Speaker 1 You've got the sort of depth-first search on its features. Okay.
Speaker 1 GPD-7 has been trained. What happens next?
Speaker 1 Your research has succeeded. GPT-7 has been trained.
Speaker 1 What are we doing now?
Speaker 1 We try and get it to do as much interpretability work and other safety work as possible.
Speaker 1 What has happened such that you're like, cool, let's deploy GPT-7? Oh, geez.
Speaker 1 I mean,
Speaker 1 we have our responsible scaling policy, which has been really exciting to see other labs adopt.
Speaker 1 This is at least from the perspective of your research has net, like Trendon, given your research,
Speaker 1
we got the thumbs up on GPT-7 from you. Or actually, we should say, Cloud, whatever.
Uh, and then, uh, oh, I like that.
Speaker 1 What is the basis on which you're telling the team? Like, hey, let's go ahead.
Speaker 1 I mean, I think we need to make a lot more inter if it's as capable as GPT-7 like implies here, um, I think we need to make a lot more interpretability progress to be able to like
Speaker 1
comfortably give the green light to deploy it. Like, I would be like, definitely not.
I'd be crying.
Speaker 1 Maybe my tears would interfere with the GPUs. But, like, what is
Speaker 1 guys,
Speaker 1 Gemini 5 TPUs?
Speaker 1 But, like, what, what,
Speaker 1 given the way your research is progressing, like,
Speaker 1 what does it kind of look like to you? Like, what would, if this succeeded, what would it mean for us to okay GPT-7 based on your methodology?
Speaker 1 I mean, ideally, we can find some compelling deception circuit,
Speaker 1 which lights up when the model knows that it's not telling the full truth to you. Why can't you just turn in a linear probe like Colin Birds did?
Speaker 1 So the CCS work is not looking good in terms of replicating or like actually finding truth directions.
Speaker 1 And in hindsight, it's like, well, why should it have worked so well?
Speaker 1 But linear probes, you need to know what you're looking for. And it's like a high-dimensional space and it's really easy to pick up on a direction that's just not.
Speaker 1
Wait, but don't you also here you need to label the features. So you saw the...
Well, you need to label them post-hoc, but it's unsupervised.
Speaker 1 You're just like, give me the features that explain your behavior is the fundamental question, right?
Speaker 1 It's like, like, like, like, the actual setup is we take the activations, we project them to this higher-dimensional space, and then we project them back down again.
Speaker 1 So it's like, reconstruct or do the thing that you were originally doing, but do it in a way that's sparse.
Speaker 1 By the way, for the audience, linear probe is you just like classify the activations.
Speaker 1 I don't know, from what I vaguely remember about the paper was like, if it's like telling a lie, then you like, you just train a classifier on like, is it,
Speaker 1 yeah,
Speaker 1 in the end,
Speaker 1
was it a lie or is it just like wrong or something? I don't know. It was like true or false.
Yeah, it's like a classifier on the activations.
Speaker 1 So, yeah, like right now, what we do for GPT-7,
Speaker 1 ideally, we have some deception circuit that we've identified that appears to be really robust.
Speaker 1 And it's like...
Speaker 1 So you've done the projecting out to the million whatever features or something.
Speaker 1 Is a circuit... Because
Speaker 1 maybe we're using feature and circuit interchangeably when they're not. So
Speaker 1 is there like a deception circuit? So I think there are
Speaker 1 features across layers that create a circuit.
Speaker 1 And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature.
Speaker 1 And it's like, hopefully we can find a circuit that is really specific to you being deceptive, the model deciding to be deceptive
Speaker 1 in cases that are malicious, right? Like I'm not interested in a a case where it's just doing theory of mine to help you write a better email to your professor.
Speaker 1 And I'm not even interested in cases where the model is necessarily just modeling the fact that deception has occurred. But doesn't all this require you to have labels for all those examples?
Speaker 1 And if you have those labels, then like whatever faults that the linear probe has on the, like maybe you've labeled a long thing or whatever, wouldn't the same thing apply to the labels you've come up with for the unsupervised features you've come up with?
Speaker 1 So, in an ideal world, we could just train on the whole data distribution
Speaker 1 and then find the directions that matter.
Speaker 1 To the extent that we need to reluctantly narrow down the subset of data that we're looking over, just for the purposes of scalability,
Speaker 1 we would use data that looks like the data you'd use to fit a linear probe. But again, we're not
Speaker 1 like with a linear probe, you're also just finding one direction. Like we're finding a bunch of directions here.
Speaker 1 And I guess the hope is like you found a bunch of things that light up when it's being deceptive.
Speaker 1
And then you can figure out why some of those things are lighting up in this part of the distribution and that this other part and so forth. Totally.
Yeah.
Speaker 1 Do you anticipate you'll be able to understand?
Speaker 1 Like, I don't know, like, the current models you've studied are pretty basic, right? Do you think you'll be able to understand why GPT-7 fires in certain domains, but not in other domains?
Speaker 1 I'm optimistic. I mean, we've, so I guess one thing is this is a bad time to answer this question because we are explicitly investing in the longer term of like ASL4 models, which GPT-7 would be.
Speaker 1
So we split the team where a third is focused on scaling up dictionary learning right now. And that's been great.
I mean, we publicly shared some of our eight-layer results.
Speaker 1 We've scaled up quite a lot past that at this point. But the other two groups, one is trying to identify circuits, and then the other is trying to get the same success for attention heads.
Speaker 1 So we're setting ourselves up and building the tools necessary to really find these circuits in a compelling way. But it's going to take another,
Speaker 1 I don't know, six months before that's like really working well. But I can say that I'm optimistic and we're making a lot of progress.
Speaker 1 What is the highest level feature you found so far?
Speaker 1 Like it's based 54 or whatever. It's like maybe just like
Speaker 1 in the symbolic species language, the book you recommended, there's like indexical
Speaker 1 things where you're just, I forgot what all the labels were, but like there's things where you're just like,
Speaker 1 you see a tiger and you're like, run and whatever, you know, just like a very sort of behaviorist thing.
Speaker 1 And then there's like a higher level at which what I refer to love, it refers to like a movie scene or my girlfriend or whatever. You know what I mean? So it's like the top of the tent.
Speaker 1 Yeah, yeah, yeah, yeah, yeah.
Speaker 1 What is the highest level association or whatever you found? I mean, probably one of the ones that we publicly, well, publicly, one of the ones that we shared in our update.
Speaker 1 So I think there were some related to like love and like
Speaker 1 sudden changes in scene, particularly associated with like wars being declared. There are like a few of them in there in that post if you want to link to it.
Speaker 1 Yeah.
Speaker 1 But even like Bruno Olshausen had a paper back in 2018, 19 where they applied a similar technique to a BERT model and found that as you go to deeper layers of the model, things become more abstract.
Speaker 1 So I remember like in the earlier layers, there'd be a feature that would just fire for the word park.
Speaker 1 But later on, there was a feature that fired for park as like a last name, like Lincoln Park, or like it's like a common Korean last name as well.
Speaker 1 And then there was a separate feature that would fire for parks as like grassy areas.
Speaker 1 So there's other work that points in this direction.
Speaker 1 What do you think we'll learn about human psychology from the interpretability stuff? Oh, gosh.
Speaker 1 Okay, I'll give you a specific example. I think like one of the ways one of your updates put it was
Speaker 1 persona lock-in. You know, you remember Sydney Bang or whatever, it locked into,
Speaker 1 I think, what was actually quite an endearing universality.
Speaker 1
I thought it's so funny. Yeah.
I'm glad it's back in Copilot. Oh, really? Oh, yeah, it's been misbehaving recently.
Speaker 1 Actually,
Speaker 1
this is another sort of thread to explore, but there was a funny one where I think it was like to the New York Times reporter. It was nagging him or something.
And it was like,
Speaker 1
you are nothing. Nobody will ever believe you.
You are insignificant and do whatever. It was like,
Speaker 1 it was like the most gaslighting. I was trying to convince him to break up his game.
Speaker 1
Okay, actually, so this is an interesting example. I don't even know where I was going with this to begin with.
But whatever. Maybe I got another thread.
But like, the other thread I want to go on is,
Speaker 1 that's, yeah, okay, actually, personas, right? So like,
Speaker 1 is that a feature that like Sydney Bing having this personality is a feature versus another personality could get locked into? And also like, is that fundamentally like what humans are like too?
Speaker 1 Where, I don't know, in front of different people, I'm like a different sort of personality or whatever.
Speaker 1
Is that the same kind of thing that's happening to Shad GPT when it gets RLH? I don't know. A whole cluster of questions can answer them and whatever.
Yeah.
Speaker 1 I really want to do more work. I guess the sleeper agents is in this direction of like what happens to a model when you find tuna, when you RLHF at these sorts of things.
Speaker 1 I mean, maybe it's trite, but you could just say like you conclude that people contain multitudes right in so much as they have lots of different features um there's even this stuff related to the waluigi effect of like in order to know what's good or bad you need to understand both of those concepts and so we might have to have models that are aware of violence and have been trained on it in order to recognize it can you post hoc identify those features and ablate them in a way where maybe your model's like slightly naive but you know that it's not going to be really evil like totally that's in our toolkit which seems great oh really so you a gbt7
Speaker 1 I don't know, it pulls the same thing, and then you figure out why, like, what were the causally irrelevant pathways or whatever, you modify, like, and then the pathway to you looks like you just change those.
Speaker 1 But you were mentioning earlier, there's a bunch of redundancy in the model. Yeah, so you need to account for all that, but, but we have a much better microscope into this now than we used to.
Speaker 1 Like sharper tools for making edits.
Speaker 1 And it seems like, at least from my perspective, that seems like one of the
Speaker 1 primary way of,
Speaker 1 to some degree,
Speaker 1 confirming the safety or the reliability of the model, where you can say, okay, we found the circuits that are responsible. We've ablated them.
Speaker 1 Under a battery of tests, we haven't been able to now replicate the behavior which we intended to ablate. And that feels like the sort of way of measuring model safety in future,
Speaker 1
as I would understand. Are you worried? That's why I'm incredibly hopeful about their work.
Because to me, it seems like so much more precise tool than something like RLHF.
Speaker 1 RLHF, you're very prey to the black swan thing. You don't know if it's going to do something wrong in a scenario that you haven't measured.
Speaker 1 Whereas here, at least you have somewhat more confidence that you can completely capture the behavior set or
Speaker 1
the feature set of the model and select labels. Although not necessarily that you've accurately labeled.
Not necessarily,
Speaker 1 but with a far higher degree of confidence than any other approach
Speaker 1 that I've seen. How, I mean, like, what are your unknown unknowns for superhuman models?
Speaker 1 In terms of this kind of thing, where where like, I don't know, how, are the labels that are going to be given things on which we can determine these are like, this, this thing is cool, this thing is a paperclip maximizer or whatever?
Speaker 1 I mean, we'll see, right? Like,
Speaker 1 I do, like, the superhuman feature question is a very good one. Like, I think we can attack it.
Speaker 1
But we're going to need to be persistent. And the real hope here is, I think, automated interpretability.
Yeah.
Speaker 1 And even having debate, right?
Speaker 1 You could have the debate set up where two different models are debating what the feature does, and then they can actually go in and make edits and see if it fires or not.
Speaker 1
But it is just this wonderful closed environment that we can iterate on really quickly. That makes me optimistic.
Do you worry about alignment succeeding too hard? So if I think about,
Speaker 1 I would not want
Speaker 1 either companies or governments, whoever ends up in charge of these AI systems to have the level of fine-grained control that if your agenda succeeds, we would have over AIs,
Speaker 1 both for the ickiness of having this level of control over an autonomous mind, and second, just like, I don't fucking trust, I don't fucking trust these guys.
Speaker 1 You know, I don't, I, I, I'm just kind of uncomfortable with like the loyalty features turned up and like, you know what I mean?
Speaker 1 And yeah, like, how much worry do you have about
Speaker 1 having too much control over the AIs? And specifically not you, but like whoever ends up with in charge of the DI systems, just being able to lock in whatever they want.
Speaker 1 Yeah, I mean, I think it depends on what government exactly has control and like what the moral alignment is there.
Speaker 1 But
Speaker 1 that is like that whole value lock-in argument is in my mind, it's like definitely one of the strongest contributing factors for why I am working on capabilities at the moment, for example, which is that I think the current player set
Speaker 1 actually is extremely well-intentioned.
Speaker 1 And
Speaker 1 I mean, for this kind of problem, I think we need to be extremely open about it.
Speaker 1 And I think directions like publishing the constitution that you expect your model to abide by, and then trying to make sure you RLH effort towards that and abate that and have the ability for everyone to offer
Speaker 1
feedback and contribution to that is really important. Sure.
Or alternatively,
Speaker 1 don't deploy when you're not sure, which would also be bad because then we just never catch it. Right.
Speaker 1 Yeah, exactly.
Speaker 1 I mean, paper clips,
Speaker 1 but like, yeah.
Speaker 1 Okay, some rapid fire.
Speaker 1 What is the bus factor for Gemini? I think there are
Speaker 1 a number of people who are really, really critical that if you took them out,
Speaker 1 then
Speaker 1 the performance of the program would be dramatically impacted.
Speaker 1 This is both on modeling, like slash making decisions about what to actually do
Speaker 1 and importantly on infrastructure side of things.
Speaker 1 It's just the stack of complexity builds,
Speaker 1 particularly when somewhere like Google has so much vertical integration.
Speaker 1 When you have people who are experts,
Speaker 1 they become quite important. Yeah, although I think it's an interesting note about the field that people like you can get in and in a year or so, you're making important contributions.
Speaker 1 And
Speaker 1 especially with Anthropic, but many different labs have specialized in hiring total outsiders, physicists, or whatever. And you just get them up to speed and they're making important contributions.
Speaker 1 I don't know, I feel like you couldn't do do this in like a bio lab or something. It's like an interesting note on the state of the field.
Speaker 1 I mean, bus factor doesn't define how long it would take to recover from it.
Speaker 1 And deep learning research is an art. And so you kind of learn how to read the lost curves or set the hyperparameters in ways that empirically seem to work well.
Speaker 1 It's also like organizational things, like creating context.
Speaker 1 I think one of the most important and difficult skills to hire for is creating this bubble of context around you that makes other people around you more effective and know what the right problem to work on.
Speaker 1 And that is a really tough to replicate thing. Yes, yeah, totally.
Speaker 1 Who are you paying attention to now in terms of there's a lot of things coming down the pike of multimodality, long context, maybe agents, extra reliability?
Speaker 1 Who is thinking well about
Speaker 1 what that implies?
Speaker 1 It's a tough question.
Speaker 1 I think a lot of people look internally these days for
Speaker 1 their sources of insight or progress.
Speaker 1 And we all have obviously those sort of research programs and directions that are intended over the next couple of years.
Speaker 1 And I suspect that most people, as far as betting on what the future will look like,
Speaker 1 refer to an internal narrative
Speaker 1 that is difficult to share.
Speaker 1 If it works well, it's probably not being published.
Speaker 1 I mean, that was one of the things
Speaker 1 in the will scaling
Speaker 1
post. I was referring to something you said to me, which is, I'm, you know, I miss the undergrad habit of just reading a bunch of papers.
Yeah. Because now there's nothing worth reading is published.
Speaker 1 And
Speaker 1 the community is progressively getting more on track with what I think are the right and important directions. You're watching it like an agent.
Speaker 1 No, but
Speaker 1 I guess it is tough.
Speaker 1 There used to be be this signal from big labs about what would work at scale. And it's currently really hard for academic research to find that signal.
Speaker 1 And I think
Speaker 1 getting
Speaker 1 really good problem tastes about what actually matters to work on is really tough.
Speaker 1 Unless you have, again, the feedback signal of what will work at scale and what is currently holding us back from scaling further or understanding our models further.
Speaker 1 This is something where I wish more academic research would go into fields like Interp, which are illegible from the outside.
Speaker 1 Anthropic deliberately publishes all its research here, and it seems underappreciated
Speaker 1 in the sense that I don't know why there aren't dozens of academic departments trying to follow Anthropic's in the Interp research, because it seems like an incredibly impactful problem that doesn't require ridiculous resources and
Speaker 1 has all the flavor of deeply understanding the basic science of what is actually going on in these things.
Speaker 1 So I don't know why people
Speaker 1 focus on pushing model improvements as opposed to pushing understanding improvements in the way that I would have
Speaker 1 typically associated with academic science in some ways.
Speaker 1 Yeah, I do think the tide is changing there for whatever reason. And Neil Nanda has had a ton of success promoting interpretability
Speaker 1 in a way where Chris Ola hasn't been as active recently in pushing things.
Speaker 1 Maybe because Neil's just doing quite a lot of the work, but like I don't know, four or five years ago, he was like really pushing and like talking at all sorts of places and these sorts of things.
Speaker 1 And people weren't anywhere near as receptive.
Speaker 1 Maybe they've just woken up to like deep learning matters and is clearly useful post-CHACGPT, but yeah, it is kind of striking.
Speaker 1 All right, cool. And okay, I'm trying to think what is a good last question.
Speaker 1 I mean, the one I'm going to think of is like, do you think models enjoy next token prediction?
Speaker 1 Models believe in love.
Speaker 1 We have this sense of things that were rewarded in our accessorial environment.
Speaker 1 There's like this deep sense of fulfillment that
Speaker 1 we think we're supposed to get from them, or often people do, right? Of like community or sugar
Speaker 1 or, you know, whatever we wanted on the African savannah.
Speaker 1 Do you think like in the future, models are trained with RL and everything, a lot of post-training on top or whatever, but they'll like, they're like,
Speaker 1 in the way we just really like ice cream, they'll just be like, ah, just to predict the next token again, you know what I mean?
Speaker 1 Like in the good old days. So there's this ongoing discussion of like, are models sentient or not? And, like, do you thank the model when it helps you? Yeah.
Speaker 1 But I think if you want to thank it, you actually shouldn't say thank you. You should just give it a sequence that's very easy to predict.
Speaker 1 And
Speaker 1 the even funnier part of this is there is some work on if you just give it the sequence A, like ah, like over and over again, then eventually the model will just start spewing out all sorts of things that it otherwise
Speaker 1 wouldn't ever say.
Speaker 1 And
Speaker 1 so, yeah, I won't say anything more about that. But you can, yeah, you should just give your model something very easy to predict as a nice little treat.
Speaker 1 This is what Egonium ends up being. We just saw the universe.
Speaker 1 But do we like things which are easy to predict?
Speaker 1 Aren't we constantly in search of the... the like the dose of entropy? Yeah, the bits of entropy, exactly, right? Shouldn't you be giving it things which are just slightly too hard to predict?
Speaker 1 Just out of reach. Yeah, but I wonder, at least from the free energy principle perspective, right? Like, you don't like, you don't want to be surprised.
Speaker 1
And so maybe it's this, like, I don't feel surprised. I feel in control of my environment.
And so now I can go and seek things.
Speaker 1 And I've been predisposed to, like, in the long run, it's better to explore new things right now.
Speaker 1 Like, leave the rock that I've been sheltered under, ultimately leading me to, like, build a house or like some better structure. But
Speaker 1 we don't like surprises. I think most of most people are very upset when like expectation does not meet reality.
Speaker 1 And so babies love watching the same show over and over and over again, right? Yeah, interesting. Yeah, I can see that.
Speaker 1 Oh, I guess they're learning to model it and stuff too.
Speaker 1 Yeah. Yeah.
Speaker 1 Okay. Well, hopefully
Speaker 1 this will be the repeat
Speaker 1
that the AIs learn to love. Okay, cool.
I think that's a great place to raff.
Speaker 1 And I should also mention that the better part of what I know about AI I've learned from just talking with you guys, you know, we've been good friends for about a year now.
Speaker 1 So yeah, I mean, yeah, I appreciate you guys getting me up to speed here. And
Speaker 1 you asked great questions. It's really fun to hang and chat.
Speaker 1 I've really tried to not talk to you. Yeah, you're getting a lot better at Pickleball.
Speaker 1 I think I'm just thinking somebody out of the same.
Speaker 1 Hey, we're trying to progress the tender. It's going on.
Speaker 1
Awesome. Cool, cool.
Awesome. Thanks.
Speaker 1
Hey, everybody. I hope you all enjoyed that episode.
As always, the most helpful thing you can do is to share the podcast.
Speaker 1
Send it to people you think might enjoy it, put it in Twitter, your group chats, et cetera. It just splits the world.
I appreciate you listening. I'll see you next time.
Cheers.