Jeff Dean & Noam Shazeer – 25 years at Google: from PageRank to AGI
This week I welcome on the show two of the most important technologists ever, in any field.
Jeff Dean is Google's Chief Scientist, and through 25 years at the company, has worked on basically the most transformative systems in modern computing: from MapReduce, BigTable, Tensorflow, AlphaChip, to Gemini.
Noam Shazeer invented or co-invented all the main architectures and techniques that are used for modern LLMs: from the Transformer itself, to Mixture of Experts, to Mesh Tensorflow, to Gemini and many other things.
We talk about their 25 years at Google, going from PageRank to MapReduce to the Transformer to MoEs to AlphaChip – and maybe soon to ASI.
My favorite part was Jeff's vision for Pathways, Google’s grand plan for a mutually-reinforcing loop of hardware and algorithmic design and for going past autoregression. That culminates in us imagining *all* of Google-the-company, going through one huge MoE model.
And Noam just bites every bullet: 100x world GDP soon; let’s get a million automated researchers running in the Google datacenter; living to see the year 3000.Watch on Youtube; listen on Apple Podcasts or Spotify.
Sponsors
Scale partners with major AI labs like Meta, Google Deepmind, and OpenAI. Through Scale’s Data Foundry, labs get access to high-quality data to fuel post-training, including advanced reasoning capabilities. If you’re an AI researcher or engineer, learn about how Scale’s Data Foundry and research lab, SEAL, can help you go beyond the current frontier at scale.com/dwarkesh
Curious how Jane Street teaches their new traders? They use Figgie, a rapid-fire card game that simulates the most exciting parts of markets and trading. It’s become so popular that Jane Street hosts an inter-office Figgie championship every year. Download from the app store or play on your desktop at figgie.com
Meter wants to radically improve the digital world we take for granted. They’re developing a foundation model that automates network management end-to-end. To do this, they just announced a long-term partnership with Microsoft for tens of thousands of GPUs, and they’re recruiting a world class AI research team. To learn more, go to meter.com/dwarkesh
To sponsor a future episode, visit dwarkeshpatel.com/p/advertise
Timestamps
00:00:00 - Intro
00:02:44 - Joining Google in 1999
00:05:36 - Future of Moore's Law
00:10:21 - Future TPUs
00:13:13 - Jeff’s undergrad thesis: parallel backprop
00:15:10 - LLMs in 2007
00:23:07 - “Holy s**t” moments
00:29:46 - AI fulfills Google’s original mission
00:34:19 - Doing Search in-context
00:38:32 - The internal coding model
00:39:49 - What will 2027 models do?
00:46:00 - A new architecture every day?
00:49:21 - Automated chip design and intelligence explosion
00:57:31 - Future of inference scaling
01:03:56 - Already doing multi-datacenter runs
01:22:33 - Debugging at scale
01:26:05 - Fast takeoff and superalignment
01:34:40 - A million evil Jeff Deans
01:38:16 - Fun times at Google
01:41:50 - World compute demand in 2030
01:48:21 - Getting back to modularity
01:59:13 - Keeping a giga-MoE in-memory
02:04:09 - All of Google in one model
02:12:43 - What’s missing from distillation
02:18:03 - Open research, pros and cons
02:24:54 - Going the distance
Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Press play and read along
Transcript
Speaker 1 Today I have the honor of chatting with Jeff Dean and Noam Shazir.
Speaker 1 Jeff is Google's chief scientist and through his 25 years at the company he has worked on basically the most transformative systems in modern computing from MapReduce, Big Table, TensorFlow, AlphaChift.
Speaker 1 Genuinely the list doesn't end. Gemini now and Noam is the single person most responsible for the current AI revolution.
Speaker 1 He has been the inventor or the co-inventor of all the main architectures and techniques that are used for modern LLMs, from the Transformer itself to mixture of experts to Mesh TensorFlow, to many other things.
Speaker 1
And they are two of the three co-leads of Gemini at Google DeepMind. Awesome.
Thanks so much for coming on.
Speaker 2 Thanks for having us. Super excited to be here.
Speaker 1 Okay, first question.
Speaker 1 Both of you have been in Google for 25 or close to 25 years. At some point early on in the company, you probably understood how everything worked.
Speaker 1 When did that stop being the case? Do you feel like there was a clear moment that happened?
Speaker 3
I mean, I know I joined and like at that point, this was like end of 2000. And they had this thing, everybody gets a mentor.
And, you know, so, you know, I knew nothing.
Speaker 3
I would just ask my mentor everything. And my mentor knew everything.
It turned out my mentor was Jeff.
Speaker 3 And it was not the case that everyone at Google knew everything.
Speaker 2 It was just the case that Jeff knew everything because he had basically written everything.
Speaker 2
You're very kind. I mean, I think as companies grow, you kind of go through these phases.
When I joined, you know, we were 25 people, 26 people, something like that.
Speaker 2 And so you eventually learned everyone's name. And even though we were growing, you kept track of all the people who were joining.
Speaker 2 At some point, then you kind of lose track of everyone's name of the company, but you still know everyone working on software engineering things.
Speaker 2 Then you sort of lose track of
Speaker 2 all the names of people in the software engineering group, but, you know, you at least know all the different projects that everyone's working on.
Speaker 2 And then at some point, the company gets big enough that, you know, you get an email that Project Platypus is launching on Friday and you're like, what the heck is Project Platypus? So I think.
Speaker 3 Usually it's a very good surprise.
Speaker 2 Like you're like, wow, Project Platypus.
Speaker 2 I have no idea we were doing that.
Speaker 3 And it turns out that.
Speaker 2 It is good to keep track of what's going on in the company, even at a very high level, even if you don't know every last detail.
Speaker 2 And it's good to know lots of people throughout the company so that you can go ask someone for more details or figure out who to talk to.
Speaker 2 I think like with one level of indirection, you can usually find the right person in the company if you have a good network of people that you've built up over time.
Speaker 1 How did Google recruit you, by the way?
Speaker 2 I kind of reached out to them, actually.
Speaker 1 And Noam, how did you get recruited? What was that issue?
Speaker 3 I actually saw Google at a job fair in like 1999, and I assumed that it was like already this huge huge company that no point in joining because everyone I knew used Google.
Speaker 3
I guess that was because I was a grad student at Berkeley at the time. I guess I've dropped out of grad programs a few times.
But
Speaker 3 it turns out that like actually it wasn't really that
Speaker 2 large.
Speaker 3 So
Speaker 3 it turns out I did not apply in 1999, but like just kind of sent them a resume on a whim in 2000 because I figured I should let it was like my favorite search engine and figured I should apply to multiple places for a job.
Speaker 3 But then, yeah, it turned out to be
Speaker 3 really
Speaker 3
fun. Looked like a bunch of smart people doing good stuff.
And they had this really nice crayon chart on the wall of the daily number of search queries that somebody had just been maintaining.
Speaker 3 And yeah, it looked very exponential.
Speaker 3
These guys are going to be very successful. And it looks like they have a lot of good problems to work on.
So I was like, okay, maybe I'll
Speaker 3 go work there for a little while and then have enough money to just go work on AI for as long as I want after that.
Speaker 2 Yeah, yeah.
Speaker 1 In a way, you did that, right?
Speaker 2 Yeah, yeah.
Speaker 3 Yeah, it totally worked out exactly according to you.
Speaker 1 Sorry, you were thinking about AI in 1999?
Speaker 3 Yeah, this was like 2000. Yeah, I remember in
Speaker 3 grad school,
Speaker 3 a friend of mine at the time had
Speaker 3 told me that his
Speaker 3 New Year's resolution for 2000 was to live to see the year 3000, and that he was going to achieve this by inventing AI.
Speaker 3 So I was like, oh, that sounds like a good idea.
Speaker 3 But then
Speaker 3 I didn't get the idea at the time that, oh, like you could go do it at a big company. But I figured, hey,
Speaker 3 a bunch of people seem to be making a ton of money at startups. Maybe I'll just make some money and then I'll have.
Speaker 3 uh you know enough to live on just work on ai research uh for for a long time yeah um
Speaker 3 But yeah, it actually turned out that Google was a terrific place to work on AI.
Speaker 2 I mean, one of the things I like about Google is our ambition has always been sort of something that would kind of require pretty advanced AI.
Speaker 2 You know, organizing the world's information and making it universally accessible and useful. Like actually,
Speaker 2 there's a really broad mandate in there. So it's not like the company was going to do this one little thing and stay doing that.
Speaker 2 And also, you could see that what we were doing initially was in that direction, but you could do so much more in that direction.
Speaker 1 How has Moore's Law over the last two, three decades changed the kinds of considerations you have to take on board when you design new systems, when you figure out what projects are feasible?
Speaker 1 What has stayed, you know, like what are still the limitations? What are things you can now do that you obviously couldn't do before?
Speaker 2 I mean, I think of it as actually changing quite a bit in the last couple of decades.
Speaker 2 So like the two decades ago to one decade ago, it was awesome because you just like wait and like 18 months later, you get much faster hardware and you don't have to do anything.
Speaker 2 And then more recently,
Speaker 2 you know, I feel like the general purpose CPU based machines scaling has not been as good. Like the fabrication processes improvements are now taking three years instead of every two years.
Speaker 2 The architectural improvements in multi-core processors and so on are
Speaker 2 not giving you the same boost that we were getting
Speaker 2 20 to 10 years ago.
Speaker 2 But I think at the same time, we're seeing
Speaker 2 much more specialized computational devices like machine learning accelerators, TPUs,
Speaker 2 very ML-focused GPUs more recently,
Speaker 2 are making it so that we can actually get really high performance and good efficiency out of the more modern kinds of computations we want to run that are. that are different than
Speaker 2 a twisty pile of C ⁇ code trying to run Microsoft Office or something.
Speaker 2 I mean, it feels like
Speaker 3 the algorithms are following the hardware. Basically, like what's happened is that at this point, arithmetic is very, very cheap and moving data around is comparatively like much more expensive.
Speaker 3 So
Speaker 3 pretty much all of deep learning has taken off roughly because of that, because it you can build it out of matrix multiplications that are
Speaker 3 n cubed operations and n squared bytes of
Speaker 3 data communication, basically.
Speaker 2 Well, I would say that the pivot to hardware oriented around that was an important transition because before that we had CPUs and GPUs that were not especially well suited for deep learning. And then
Speaker 2 we started to build say TPUs at Google
Speaker 2 that were really just reduced precision linear algebra machines.
Speaker 2 And then once you have that, then you want to
Speaker 3 see the insight that seems like it's all about
Speaker 3 all about kind of identifying opportunity costs. Like, okay,
Speaker 3 this is something like Larry Page, I think, used to always say, like, our second biggest cost is taxes and our biggest cost is opportunity costs.
Speaker 3 And if he didn't say that, then I've been misquoting him for years.
Speaker 3 But basically, it's like, you know, what
Speaker 3 what is the opportunity that you have that you're missing out on?
Speaker 3 And like in this case, I guess it was that, okay, you've got all of this chip area and you're putting a very small number of arithmetic units on it. Like fill the thing up with arithmetic units.
Speaker 3 You could have orders of magnitude more arithmetic getting done. Now what else has to change? Okay, the algorithms and the data flow and everything else.
Speaker 2 And oh, by the way, the arithmetic can be like really low precision, so then you can squeeze even more multiplier units in.
Speaker 1 No, I want to follow up on what you said that the algorithms have been following the hardware.
Speaker 1 If you imagine in a counterfactual world where suppose that the cost of memory had declined more than arithmetic or
Speaker 1 just like remember the dynamic you saw
Speaker 3 that, okay, data flow is extremely cheap and arithmetic is not.
Speaker 1 What would AI look like today?
Speaker 2
That's you'd have a lot more lookups into very large memories. Yes.
I think.
Speaker 3 Yeah. I mean, I think it might look more like AI looked like 20 years ago, but in the opposite direction.
Speaker 3 I'm not sure.
Speaker 3 I guess I joined Google Brain in 2012.
Speaker 3 I'd left Google for a few years, happened to go back for lunch to visit my wife. And
Speaker 3 we happened to sit down next to Jeff and the early Google Brain team. And I thought, wow, that's a smart group of people.
Speaker 2 You should think about be braille nuts because we're making some pretty good progress there. That sounds fun.
Speaker 3 So, okay, so I jumped back in.
Speaker 2 I rejoined back. It was
Speaker 3 to join Jeff. That was like 2012.
Speaker 3 I seem to join Google every 12 years.
Speaker 3 I rejoined Google in 2012 and 2024.
Speaker 1 What's going to happen in 2036?
Speaker 2 I don't know. I guess
Speaker 3 we shall see.
Speaker 1 What are the trade-offs that you're considering changing for future
Speaker 1 versions of TPU to integrate how you're thinking about algorithms depending?
Speaker 2 I mean, I think one thing, one general trend is we're getting better at quantizing or having much more reduced precision models
Speaker 2 uh you know we started with tpuv1 we weren't even quite sure we could quantize a model for serving with 8-bit integers but we sort of had some early evidence that seemed like it might be possible so we're like great let's build the whole chip around that
Speaker 2 and then over time i think you've seen people uh able to use much lower precision for training as well but also the inference precision has you know gone people are now using int4 or fp4, which sounded like if you said to someone like, we're going to use fp4 to like a supercomputing floating point person in front of you as well, they'd be like, what?
Speaker 2 That's crazy. We like 64 bits in our floats.
Speaker 2 Or even below that, some people are quantizing models to two bits or one bit. And I think...
Speaker 2 that's a trend to definitely pay attention to.
Speaker 2 Yeah, just a 01.
Speaker 2 And then you have like a sign bit for a group of bits or something.
Speaker 3 It really has to be a co-design thing because, you know, if
Speaker 3 the algorithm designer doesn't realize that you can get
Speaker 3 greatly improved performance, you know, throughput with the lower precision, of course, the algorithm designer is going to say, of course, I don't want low precision.
Speaker 3
That introduces risk. And then, you know, it adds irritation.
And then
Speaker 3 if you ask the chip designer,
Speaker 3 okay,
Speaker 3 what do you want to build? And then they'll ask the person who's writing the algorithms today who's going to say, no,
Speaker 3
I don't like quantization. It's irritating.
So you actually
Speaker 3 need to basically see the whole picture and figure out, oh, wait a minute,
Speaker 3 we can increase our throughput to cost ratio by a lot by
Speaker 2
quantizing. Then you're like, yes, quantization is irritating, but your model is going to be three times faster.
So you're going to have to deal.
Speaker 1 Through your careers, at various times, you've had sort of an uncanny,
Speaker 1 you worked on things that have an uncanny resemblance to what is actually, what we're actually using now for generative AI. In 1990, Jeff, your senior thesis was about
Speaker 1
back progregation. And in 2007, so this is the thing I didn't realize until I was working for this episode.
In 2007, you guys trained a two trillion token n-gram model for language modeling.
Speaker 1 Just walk me through when you were developing that model,
Speaker 1 was this kind of thing in your head? What did you think you guys were doing at the time?
Speaker 2 Yeah, so
Speaker 2 I mean, let me start with the undergrad thesis. So I kind of got introduced to neural nets in one section of one class on parallel computing that I was taking in my senior year.
Speaker 2
And I needed to do a thesis to graduate, like an honors thesis. And so I approached the professor and I said, oh, it'd be really fun to like do something around neural nets.
So he and I decided
Speaker 2 I would sort of implement a couple of different ways of parallelizing back propagation training for neural nets in 1990.
Speaker 2 And I called him something funny in my thesis, like pattern partitioning or something. But really, I implemented a
Speaker 2 model parallelism and data parallelism on a
Speaker 2 32-processor hypercube machine.
Speaker 2 In one, you split all the examples into different batches, and
Speaker 2 every CPU has a copy of the model. And in the other one, you kind of pipeline a bunch of examples along to processors that have different
Speaker 2 parts of the model. And
Speaker 2
I compared and contrasted them. And it was interesting.
I was really excited about the abstraction because it felt like neural nets were the right abstraction.
Speaker 2 They could solve tiny toy problems that no other approach could solve at the time.
Speaker 2 But, and I thought, oh, you know, naive me, oh, 32 processors, we'll be able to train like really awesome neural nets.
Speaker 2 But it turned out, you know, we needed about a million times more compute before it really started to work for real problems.
Speaker 2 But then starting, you know, in the, you know, late 2008, 2009, 2010 timeframe, we started to have enough compute
Speaker 2
thanks to Moore's Law to actually make neural nets work for real things. And that was kind of when I sort of re-entered looking at neural nets.
But prior to that, in 2007.
Speaker 2 You can ask this question.
Speaker 1 First of all,
Speaker 1 unlike other artifacts of academia, it's actually like a really, like it's like four pages and you can just like read it. And
Speaker 2 there's four pages and then like 30 pages of C code.
Speaker 1 But it's just like a well-produced sort of artifact.
Speaker 1 And then yeah, tell me about how the 2007 paper came together. Oh, yeah.
Speaker 2 So that,
Speaker 2 we had a machine translation research team at Google, led by Franz Ock, who had joined Google maybe a year before,
Speaker 2 and a bunch of other people. And every year they competed in a,
Speaker 2 I guess it's a DARPA contest on translating a couple of different languages to English. I think Chinese to English and Arabic to English, I think.
Speaker 2 And
Speaker 2 the Google team had submitted an entry. And the way this works is you get like, I don't know, 500 sentences on Monday and you have to submit the answer on Friday.
Speaker 2 And so I saw the results of this and we'd won the contest
Speaker 2 and
Speaker 2 by a pretty substantial margin measured in blue score, which is like a measure of translation quality. And so I reached out to Franz,
Speaker 2
the head of this winning team. I'm like, this is great.
When are we going to launch it? And he's like, oh, well, we can't launch this.
Speaker 2 It's not really very practical because it takes 12 hours to translate a sentence.
Speaker 2 I'm like, well,
Speaker 2 that seems like a long time.
Speaker 2 How could we fix that? So it turned out, you know, they'd not really designed it for high throughput, obviously.
Speaker 2 And so it was doing like 100,000 disk seeks
Speaker 2 in a large language model that they
Speaker 2 sort of computed statistics over. I wouldn't say train, really.
Speaker 2 And,
Speaker 2
you know, for each word that it wanted to translate. So like obviously doing 100,000 disk seeks is not super smoothie.
But I said, okay, well, let's dive into this.
Speaker 2 And so I spent about two or three months with them
Speaker 2 designing an in-memory compressed representation of n-gram
Speaker 2
data. And we were using an n-gram is basically statistics for how often every n-word sequence occurs in a large corpus.
So you basically have, in this case, we had like 2 trillion words.
Speaker 2 And most n-gram models of the day were like using 2 grams or maybe 3 grams. But we decided we would use 5 grams.
Speaker 2 So how often every five-word sequence occurs in basically as much of the web as we could process that in that day.
Speaker 2 And then you have a data structure that says, okay,
Speaker 2 I really like this restaurant occurs, you know, 17 times in the web or something.
Speaker 2 And so I built like a data structure that would let you store all those in memory on 200 machines and then have sort of a batched API where you could say, here are the 100,000
Speaker 2 things I need to look up in this round for this word, and it would give you them all back in parallel.
Speaker 2 And that enabled us to go from taking a night to translate a sentence to basically doing something in 100 milliseconds or something.
Speaker 1 There's this list of Jeff Dean facts, like Chuck Norris facts.
Speaker 1 Like, for example, that for Jeff Dean, NP equals no Prev Lemo.
Speaker 1 And one of them, it's funny because now that I hear you say it, it's like, actually, it's kind of true.
Speaker 1 One of them is the speed of light was 35 miles an hour until Jeff Dean decided to optimize it over a weekend.
Speaker 1 Just going from 12 hours to 100 milliseconds or whatever is like, I got to do the orders of magnitude there.
Speaker 2 All of these are very flattering.
Speaker 2 They're pretty funny. They're like an April Fool's joke gone awry by my smelling machine.
Speaker 3 Okay, so
Speaker 1 obviously in retrospect, this idea that you can develop a latent representation of the entire internet through just considering relationships between words is like, yeah, this is large language models.
Speaker 1 This is Gemini. At the time, was it just a translation idea, or did you see that as being the beginning of a different kind of paradigm?
Speaker 2 I think once we built that for translation, the serving of large language models started to be used for other things like completion of, you know, you start to type and it suggests like what completions make sense.
Speaker 2 So it was definitely the start of
Speaker 2
a lot of uses of language models in Google. And, you know, GNOME has worked on a number of other things at Google, like spelling correction systems that use language models.
Yeah.
Speaker 2 Yeah, I think, yeah, but that was like
Speaker 3 2000, 2001.
Speaker 3 And there, I think it was just all in memory
Speaker 3 on one machine. Yeah, I think it was one machine.
Speaker 2 But his spelling correction system he built in 2001 was amazing. Like he sent out this demo link to the whole company.
Speaker 2 And like, I just tried every butchered spelling of every few word query I could get. I like scrumbled Uggs Bundict.
Speaker 2
Oh, I remember that one. Yeah, yeah.
Instead of scrambled eggs benedict. And like it just nailed it every time.
Yeah.
Speaker 3 And I guess that was language modeling.
Speaker 1 Yeah. But at the time when you were developing the systems,
Speaker 1 did you have this sense of,
Speaker 1 look, you make these things more and more sophisticated. You don't consider five words, but if you consider 100 words, a thousand words, then the latent representation is intelligence.
Speaker 1 Or was that, like, basically, when did that insight hit?
Speaker 2 Not really.
Speaker 3 I mean, like, not like, I don't think I ever felt like, okay, n-gram models are going to, you know, are going to sweep the world.
Speaker 3 yeah the the uh artificial intelligence um i think i think at the time uh i was a lot of people were excited about the bayesian networks uh that was that seemed exciting definitely seeing like those early neural language models um you know there's but both both the magic in that okay this is doing something extremely cool and also
Speaker 3 uh also it's just
Speaker 3
struck me as like the the best problem in the world. Like in that, like for one, it is very, very simple to state.
Like, give me a probability distribution over the next word.
Speaker 3 Also,
Speaker 3 there's roughly infinite training data out there. There's like the text of the web, you have like trillions of training examples, like, um, you know, of unsupervised data.
Speaker 2 And then self-supervised.
Speaker 2 Yeah, it's nice because you then have the right answer, and then you can train on like all but the current word and try to predict the current word.
Speaker 2 And it's this kind of amazing, you know, ability to just learn from observations of the world.
Speaker 3 And then it's AI complete. If you can do a great job of that, then you can pretty much
Speaker 3 do anything.
Speaker 1 I'm excited to introduce our new sponsor, Meter. They're a networking company that is behind a growing fraction of the world's internet infrastructure.
Speaker 1 Fun fact, about three to four years ago, in the very early days of the podcast, I ran this podcast from a donation from Meter CEO Anil, and I continue to benefit enormously from his advice to this day.
Speaker 1 The modern world runs on networks.
Speaker 1 Progress in fields as diverse as self-driving cars to giant LLM training runs to even broadcasting a podcast like this around the world is bottlenecked on designing and debugging large complex networks.
Speaker 1 Meter wants to give network engineers a 100x multiplier by training a large end-to-end foundation model using time series packet data and support tickets and networking textbooks and all the other proprietary data they have as a result of themselves building every layer of the networking stack in-house.
Speaker 1 Meter just announced a long-term compute partnership with Microsoft for access to tens of thousands of GPUs. They're currently recruiting a world-class AI research team.
Speaker 1
Their goal is to build autonomous networks that radically improve the digital world that we take for granted. To learn more, go to meter.com slash Markeesh.
All right, back to Jeff and Noam.
Speaker 1 There's this interesting discussion in the history of science about whether ideas are just in the air and there's a sort of inevitability to big ideas or whether it's sort of plucked out of some tangential direction.
Speaker 1 In this case, this way in which you're laying it out very logically,
Speaker 1 does that imply, like basically,
Speaker 1 how inevitable does this?
Speaker 3 It does feel like it's in the air. There were definitely some,
Speaker 3 there was like this neural Turing machine. So, yeah, a bunch of ideas around this attention slash there's like having these key value stores that
Speaker 3 could be useful in neural networks to kind of focus on things.
Speaker 3 So, yeah, I think
Speaker 3 in some sense in the air. And in some sense, you know, you need some group to
Speaker 2 go do it.
Speaker 2 I mean, I like to think of a lot of ideas as they're kind of partially in the air where there's like a few different maybe separate research ideas that one is kind of squinting at when you're trying to solve a new problem.
Speaker 2 And you kind of draw on those for some inspiration. And then there's like some aspect that is not solved, and you sort of need to figure out how to solve that.
Speaker 2 And then the combination of like some morphing of the things that already exist and some new things lead to some new breakthrough or a new research result that didn't exist before.
Speaker 1 Are there key moments that stand out to you where you looking at a research area and you come up with this idea and you have this feeling of like, holy shit, I can't believe that worked.
Speaker 2 One thing I remember was, you know, we'd been
Speaker 2 in the early days of the brain team, we were focused on, let's see if we can build some infrastructure that lets us train really, really big neural nets.
Speaker 2
And at that time, we didn't have GPUs in our data centers. We just had CPUs, but we know how to make lots of CPUs work together.
So we built a system that enabled us to train
Speaker 2 pretty large neural nets through both model and data parallelism. So we had a system for unsupervised learning on
Speaker 2 actually 10 million randomly selected YouTube frames.
Speaker 2 And it was kind of a
Speaker 2 spatially local representation. So it would build up unsupervised representations based on trying to reconstruct the thing from the high-level representations.
Speaker 2 And so we got that working and training on 2,000 computers using 16,000 cores.
Speaker 2 And
Speaker 2 after a little while, that model was actually able to build a representation at the highest level where one neuron would get excited by
Speaker 2 images of cats that it had never been told what a cat was, but it sort of had seen enough examples of them in the training data of head-on facial views of cats that that neuron would turn on for that and not for much else.
Speaker 2 And similarly, you'd have other ones for human faces and, you know, backs of pedestrians and this kind of thing.
Speaker 2 And so that was kind of cool because it's sort of from unsupervised learning principles building up these really high-level representations.
Speaker 2 And then we were able to get, you know, very good results on the supervised ImageNet 20,000 category challenge that like advanced the state of the art by like 60% relative improvement, which was quite good at the time.
Speaker 2 So that to me, and that neural net was probably 50x bigger than one that had been trained previously.
Speaker 2
And it got good results. So that sort of said to me, hey, actually scaling up neural net seems like a, I thought it would be a good idea and it seems to be.
So we should keep pushing on that.
Speaker 1 So
Speaker 1 these examples illustrate how these AI systems
Speaker 1 fit into what you were just mentioning, that Google is sort of a company that organizes information fundamentally.
Speaker 1 And then you can, basically, what AI is doing in this context is finding relationships between information, between concepts to help get ideas to you faster, information you want to you faster.
Speaker 1 Now we're moving with current AI models, like obviously they're very, you know, you can use bird and Google search and you can ask these
Speaker 1 questions and they obviously are still good at information retrieval. But more fundamentally, you know, like they're like, they can like write your entire code base for you and do all, you know, like
Speaker 2 actual worker,
Speaker 1 which is going beyond the just like information retrieval. So
Speaker 1 has,
Speaker 1 yeah, has your how are you thinking about like is Google still an information retrieval company if you're like building an AGI?
Speaker 1 Like AGI can do information retrieval, but it can do many other things as well, right?
Speaker 2 I think we're an organized the world's information company, and that's broader than information retrieval, right?
Speaker 2 That's maybe organizing and creating new information from, you know, some guidance you give it. Can you help me write a letter to my to my veterinarian about my dog?
Speaker 2 It's got these symptoms and it'll draft that. Or can you feed in this video? And, you know, can you produce a summary of like what's happening in the video every few minutes?
Speaker 2 And, you know, I think our sort of multimodal capabilities are showing that it's more than just text.
Speaker 2 It's about, you know, understanding the world and all the different kinds of modalities that information exists in, both kind of human ones, but also
Speaker 2 kind of non-human oriented ones like weird LiDAR sensors on autonomous vehicles or, you know, genomic information or health information.
Speaker 2 And then helping, how do you extract and transform that into useful insights for people and make use of that in helping them do all kinds of things they want to do.
Speaker 2 And that's, you know, sometimes it's, I want to be entertained by chatting with a chatbot. Sometimes it's, I want answers to this really complicated question.
Speaker 2 There is no single source to retrieve from.
Speaker 2 It's you need to pull information from like 100 web pages and like figure out what's going on and make a organized, synthesized version of that data and then dealing with multimodal things or coding related problems.
Speaker 2
I think it's super exciting what these models are capable of and they're improving fast. So I'm excited to see where we go.
I don't know what you're doing.
Speaker 3 I am also excited to see where we go. And yeah, I think definitely the
Speaker 3 organizing information
Speaker 3 is clearly like a
Speaker 3 trillion dollar opportunity, but a trillion dollars is not not cool anymore. What's cool is a quadrillion dollars.
Speaker 3 I mean, and obviously
Speaker 3 the idea is not to just pile up some giant pile of money, but it's to just create value in the world, you know, and so much more value can be created when
Speaker 3 these
Speaker 3 systems can actually go and do something for you, write your code, or figure out problems that
Speaker 3 you wouldn't have been able to figure out yourself
Speaker 3 and to do that at scale. So I mean, mean,
Speaker 3 we're going to have to be very, very flexible and dynamic
Speaker 3 as we improve the capabilities of these models.
Speaker 2 Yeah, I guess I'm pretty excited about kind of a lot of fundamental research questions that sort of come about because you see something that we're doing could be substantially improved if we tried this approach or things in this rough direction.
Speaker 2 And maybe that'll work, maybe it won't.
Speaker 2 But I also think there's value in seeing what we could achieve for end users users and then how can we work backwards from that to actually build systems that are able to do that. So as one example,
Speaker 2 organizing information, that should mean any information of the world should be usable by anyone, regardless of what language they speak.
Speaker 2 And that I think, you know, we've done some amount of, but it's not nearly the full vision of, you know, no matter what language you speak out of thousands of languages, we can make any piece of content available to you and
Speaker 2 make it usable by you.
Speaker 2 And any video could be watched in any language. I think that would be pretty awesome.
Speaker 2 And we're not quite there yet, but that's definitely things I see on the horizon that should be possible.
Speaker 1 Aaron Powell, speaking of different architectures you might try, I know one thing you're working on right now is longer context.
Speaker 1 If you think of Google search as like it's got the entire index of the internet in its context, but it's like sort of very like shallow search.
Speaker 1 And then obviously language models have like limited context right now, but they can like really think, it's like dark magic, like in context learning, right?
Speaker 1 It just like can really think about what it's seeing.
Speaker 1 How do you think about what it would be like to merge something like Google search and something like in context learning?
Speaker 2 Yeah, maybe I'll take a first stab at it. I mean, because I've thought about this for a bit.
Speaker 2 I mean, I think one of the things you see with these models is they're quite good, but they do hallucinate and have factuality issues sometimes.
Speaker 2 And part of that is you've trained on, say, tens of trillions of tokens and you've stirred all that together in your tens or hundreds of billions of parameters.
Speaker 2 But it's all a bit squishy because you've like
Speaker 2 churned all these tokens together. And so the model has like a reasonably clear view of that data, but it sometimes like gets confused and will give the wrong date for something.
Speaker 2 Whereas information in the context window, in the input of the model, is like really sharp and clear Cause we have this really nice attention mechanism and transformers that the model can pay attention to things and it knows kind of the exact text or the exact frames of the video or audio or whatever that it's processing.
Speaker 2 And so right now we have a
Speaker 2 models that can deal with kind of millions of tokens of context, which is quite a lot. It's like hundreds of pages of a PDF or
Speaker 2 50 research papers or hours of video or tens of hours of audio or some combination of those things, which is pretty cool.
Speaker 2 But it would be really nice if the model could attend to trillions of tokens, right? Could it attend to the entire internet and find the right stuff for you?
Speaker 2 Could it attend to all your personal information for you?
Speaker 2 I would love a model that has access to all my emails and all my documents and all my photos.
Speaker 2 And when I ask it to do something, it can sort of make use of that with my permission to sort of help solve what it is I'm wanting it to do.
Speaker 2 But that's going to be a big computational challenge because the naive attention algorithm is quadratic.
Speaker 2 And you can kind of barely make it work on a fair bit of hardware for millions of tokens, but there's no hope of making that just naively go to trillions of tokens.
Speaker 2 So we need a whole bunch of interesting algorithmic approximations to what you would really want to make a way for the model to attend kind of conceptually to lots and lots of more tokens, to fairly list of tokens and attend to your tokens you know maybe we can put all of the google code base in context for every google developer uh all the world's source code in context for any open source developer that would be amazing it would be it would be incredible yeah i i mean right the the the yeah the beautiful thing about you know model parameters is they are
Speaker 3 quite memory efficient at you know sort of memorizing facts maybe you know you can probably memorize order of one fact or something per model parameter, whereas
Speaker 3 if you have some token in context, there are like lots of keys and values at every layer. It
Speaker 3 could be a kilobyte, a megabyte of
Speaker 3 memory per token.
Speaker 2 You take a word and you blow it up to 10 kilobytes or something. Yes, yes.
Speaker 3 Yeah. So, I mean, so there are some,
Speaker 3 there's actually a lot of innovation going on around, okay, A, how do you minimize that? And B, okay,
Speaker 3 what words do you need to have there? Are there better ways of accessing
Speaker 3 bits of that information? And, you know, Jeff seems like the right person to figure this out, like, okay,
Speaker 3 what does our memory hierarchy look like,
Speaker 3 you know, from the, you know, SRAM all the way up to data center?
Speaker 3 worldwide level.
Speaker 1 I want to talk more about the thing you mentioned about, look, you know, Google is a company with like lots of code and lots of examples, right?
Speaker 1 If you just think about that one use case and what that implies, so you've got like the Google monorepo,
Speaker 1 and if you maybe you figure out the long context thing, you could put the whole thing in context or you fine-tune on it.
Speaker 2 Yeah, basically, like,
Speaker 1 why hasn't this been already done? And, you know, because you can imagine like,
Speaker 1 the amount of code that Google has proprietary access to,
Speaker 1 even if you're just using it internally for it to make your developers more efficient and productive.
Speaker 2 Oh, to be clear, we have actually already done further training on a Gemini model on our internal code base for our internal developers.
Speaker 2 But that's different than attending to all of it
Speaker 2 because it sort of stirs together the code base into a bunch of parameters.
Speaker 2 And I think having it in context
Speaker 2 makes things clearer. But even the sort of further trained model internally is incredibly useful.
Speaker 2 Like Sundar, I think, has said that 25% of the characters that we're checking into our code base these days are generated by our AI-based coding models with kind of human kind of
Speaker 1 year or two, based on the capabilities you see around the horizon, your own personal work. What will it be like to be a researcher at Google? You have a new idea or something
Speaker 1 with the way in which you're enacting these models in Europe. What does that look like?
Speaker 3 Well, I mean, I assume
Speaker 3 we will have these models a lot better and
Speaker 3 hopefully be able to be much, much more productive.
Speaker 2 Yeah, I mean, I think one, one of the,
Speaker 2 in addition to kind of researchy context, like anytime you're seeing these models used, I think they're able to make software developers more productive because they can kind of take sort of a high-level
Speaker 2 spec or in sentence description of what you want done and give a pretty approximate, you know, pretty reasonable first cut at that.
Speaker 2 And so from a research perspective, maybe you can say, I'd really like you to explore, you know, this kind of idea, like similar to the one in this paper, but maybe like, let's try making it convolutional or something.
Speaker 2 Like that, if you could do that and have the system automatically sort of generate a bunch of experimental code and maybe you look at it and you're like, yeah, that looks good, run that.
Speaker 2 Like that seems like a nice dream direction to go in and seems plausible in the next year or two years that you might make a lot of progress on that and it seems underhyped because you've got like
Speaker 1 you could have like literally millions of extra employees um and you can immediately check their output that employees can check either each other's output they like immediately stream tokens yeah
Speaker 2 i didn't mean to underhype it i think it's super exciting
Speaker 2 i just don't like to hype things that aren't done yet
Speaker 1 um
Speaker 1 yeah so let's i i do want to play with this idea more because you know it seems like you have deal if you have something like kind of like an autonomous software engineer, especially from the perspective of a researcher who's like, I want to spec build the system.
Speaker 1 Again, okay, so let's just play with this idea. Like
Speaker 1 as somebody who has worked on developing transformative systems through your careers, the idea that instead of having to code something like whatever the today's equivalent of MapReduce is or TensorFlow is, just like, here's how I want like distributed
Speaker 1 AI library to look like, write it up for me.
Speaker 1 Do you imagine you could be like 10x more productive, 100x more productive?
Speaker 2 I was pretty impressed. I think it was on Reddit that I saw, like, we have a new experimental coding model that's much better at coding and math and so on.
Speaker 2 And someone external tried it, and they basically prompted it and said,
Speaker 2 I'd like you to implement a SQL processing database system with
Speaker 2 no external dependencies. And please, please please do that in C.
Speaker 2 And from what the person said, it actually did a quite good job. Like it generated a SQL parser and a tokenizer and
Speaker 2 a query planning system and some storage format for the data on disk and actually was able to handle simple queries. So
Speaker 2 from that prompt, which is like a paragraph of text or something, to get
Speaker 2 even
Speaker 2 an initial cut at that seems like a big boost in productivity for software developers.
Speaker 2 And I think you might end up with
Speaker 2 other kinds of systems that maybe don't try to do that in a single, you know, in semi-interactive respond in 40 second kind of thing, but might go off for 10 minutes and might interrupt you after five minutes saying,
Speaker 2 I've done a lot of this, but now I need to
Speaker 2 get some input. You know, do you care about handling video or just images or something?
Speaker 2 And that seems like you'll need ways of managing the workflow if you have a lot of these kind of
Speaker 2 background activities happening. Yeah.
Speaker 1 Actually, can you talk more about that? So what interface do you imagine we might need if we have,
Speaker 1 if you could literally have like millions of employees you could spin up, hundreds of thousands of employees you could spin up on command who are able to type incredibly fast and who
Speaker 1 So it's almost like you go from like 1930s like trading of like tickets or something to now modern like, you know, chainsuit or something you know like you need a better you need some interface to keep track of all this that's going on for the AIs to integrate into this big monore repo and leverage their own like strengths for humans to keep track of what's happening basically what is it like to be
Speaker 3 Jeffer Noam in three years working day to day it might be kind of similar to what we have now because we already have sort of parallelization as a major issue because you know we have like lots and lots of really really brilliant machine learning researchers and we want them to all work together and
Speaker 3 build AI.
Speaker 3 So actually the parallelization among people might be similar to parallelization
Speaker 3 among machines.
Speaker 3 But I think there definitely should be good for things that require like a lot of exploration, you know, like
Speaker 3 come up with the next breakthrough.
Speaker 2 Because
Speaker 3 if you have a brilliant idea that it's just certain to work
Speaker 3 in the ML domain, then it has a 2% chance of working if you're brilliant. And
Speaker 3 mostly these things fail, but if you try 100 things or 1,000 things or a million things, then you might hit on something
Speaker 3 amazing.
Speaker 3 And we have plenty of compute, like a modern
Speaker 3 top labs these days have probably a million times as much compute as it took to train Transformer.
Speaker 1 So, uh, yeah, actually, so that's a really interesting idea.
Speaker 1 If you have, um, like, suppose in the world today, there's like on the order of 10,000 AI researchers, and this community coming up with a breakthrough every year.
Speaker 2 Probably more than that. There were 15,000 at NERPS files.
Speaker 2 100,000? I don't know.
Speaker 1 Yeah, maybe.
Speaker 2 Sorry. No, no, no.
Speaker 2 It's good to have the correct order of magnitude.
Speaker 1 And the odds that this community every year comes up with a breakthrough on the scale of a transformer is, let's say, 10%.
Speaker 1 Now, suppose this community is a thousand times bigger, and it is, in some sense, like this sort of parallel search of better architectures, better techniques.
Speaker 1 Do we just get like transformer size breakthroughs every year or every day?
Speaker 3 Maybe.
Speaker 3 Sounds potentially good.
Speaker 1 But does that feel like what ML research is like? It's just if you have
Speaker 1 if you are able to try all these experiments.
Speaker 3 It's a good question because we, you know, I don't know that folks
Speaker 3 haven't been doing that as much.
Speaker 3 I mean, we definitely have lots of great ideas coming along. Everyone seems to want to run their experiment at maximum scale, but I think that's, you know, that's a human problem.
Speaker 2 Yeah.
Speaker 2
Yeah. It's very helpful to have a one one thousandth scale problem and then vet like a hundred thousand ideas on that and then scale up the ones that are seem promising.
Yeah.
Speaker 1 A quick word from from our sponsor, Scale AI. Publicly available data is running out.
Speaker 1 So major labs like Meta and Google DeepMind and OpenAI all partner with Scale to push the boundaries of what's possible.
Speaker 1 Through Scale's Data Foundry, major labs get access to high quality data to fuel post-training, including advanced reasoning capabilities.
Speaker 1 As AI races forward, we must also strengthen human sovereignty.
Speaker 1 Scale's research team, SEAL, provides practical AI safety frameworks, evaluates frontier AI system safety via public leaderboards, and creates foundations for integrating advanced AI into society.
Speaker 1 Most recently, in collaboration with the Center for AI Safety, SCAAL published Humanities Last Exam, a groundbreaking new AI benchmark for evaluating AI systems expert level knowledge and reasoning across a wide range of fields.
Speaker 1 If you're an AI researcher or engineer and you want to learn more about how SCALE's data foundry and research team can help you go beyond the current frontier of capabilities, go to scale.com/slash dwarquesh.
Speaker 1 All right, back to Jeff and Noam. So I think one thing the world might not be taking seriously,
Speaker 1 people are aware that it's exponentially harder to make,
Speaker 1 like, to do the scale, like make a model that's 100x bigger, is like 100x more compute, right?
Speaker 1 So it's like, people are aware that's like an exponentially harder problem to go from Gemini 2 to 3 or so forth. But maybe people aren't aware of this.
Speaker 1 other trend where Gemini 3 is coming up with all these different architectural ideas and trying them out, and you see what works, and you're constantly coming up with these algorithmic progress that makes training the next one easier and easier.
Speaker 2 How far could you take that feedback, Lou?
Speaker 2 I mean, I think one thing people should be aware of is the improvements from generation to generation of these models often are partially driven by hardware and larger scale, but equally and perhaps even more so driven by major algorithmic improvements and major changes in the model architecture and the training data mix and so on that really make the model better per
Speaker 2 flop that is applied to the model. So I think that's a good realization.
Speaker 2 And then I think if we have automated exploration of ideas, we'll be able to vet a lot more ideas and bring them into kind of the actual
Speaker 2 production training for next generations of these models.
Speaker 2 And that's going to be really helpful because that's sort of what we're currently doing with a lot of machine learning research, brilliant machine learning researchers is looking at lots of ideas, winnowing ones that seem to work well at small scale, seeing if they work well at medium scale, bringing them into larger scale experiments, and then like settling on like adding a whole bunch of new and interesting things to the final model recipe.
Speaker 2 And then I think if we can do that, you know, 100 times faster through
Speaker 2 those machine learning researchers just gently steering a more automated search process rather than sort of hand babysitting lots of experiments themselves, that's going to be really, really good.
Speaker 3 Yeah. the one thing it doesn't speed up is like
Speaker 3 experiments at the largest scale because you still end up doing like these n equals one experiments, and they're really just try to put a bunch of really brilliant people in the room and have them stare at and stare at the thing, figure out why this is working, why this is not working.
Speaker 2 More hardware is a good solution, and better hardware.
Speaker 3 Yes, we're counting on you.
Speaker 1 So,
Speaker 2 okay, and naively, I would say
Speaker 1 so there's this software,
Speaker 1 there's this like algorithmic side improvement that future AIs can make. There's also
Speaker 1 the stuff you're working on on OffleChip, I'll let you describe it, but if you get into a situation where just from a software level, you can be making better and better chips in a matter of weeks and months,
Speaker 1 and better AIs can presumably do that better.
Speaker 1 Basically, I'm wondering, how does this feedback loop not just end up in like
Speaker 1 Gemini 3 takes two years, then Gemini 4 is like a six, or the equivalent level jump is now six months, then like the level five is like three months, then a one month, and you get to like superhuman intelligence much more rapidly than you might naively think because of this software,
Speaker 1 both on the hardware side and from the algorithmic side improvements.
Speaker 2 Yeah, I mean, I've been pretty excited lately about how could we dramatically speed up the chip design process.
Speaker 2 Because as we were talking earlier,
Speaker 2 the current way in which you design a chip takes you roughly 18 months to go from we should build a chip to something that you then hand over to TSMC and then TSMC takes
Speaker 2 four months to fab it and then you get it back and you put it in your data centers. So that's a pretty lengthy cycle and the fab time in there is a pretty
Speaker 2 you know, small portion of it today. But if you could make that the dominant portion so that instead of taking
Speaker 2 12 to 18 months to design the chip, you could shrink, and with
Speaker 2 150 people,
Speaker 2 you could shrink that to a few people with a much more automated search process, exploring the whole design space of chips and getting feedback from all aspects of the chip design process for the kind of choices that the system is trying to explore at the high level.
Speaker 2 Then I think you could get
Speaker 2 perhaps much more exploration and more more rapid design of something that you actually want to give to a fab. And that would be great because you can shrink that time.
Speaker 2 You can shrink the deployment time by kind of designing the hardware in the right way so that you just get the chips back and you just plug them in to some
Speaker 2 system.
Speaker 2 And that will then, I think, enable a lot more specialization.
Speaker 2 It will enable a shorter timeframe for the hardware design so that you don't have to look out quite as far into what kind of ML algorithms would be interesting.
Speaker 2 Instead, it's like you're looking at six to nine months from now,
Speaker 2
what should it be rather than two, two and a half years. And that would be pretty cool.
I do think that that fabrication time is, if that's in your inner loop of improvement, you're going to like.
Speaker 2 How long is it? The leading edge nodes, unfortunately, are taking longer and longer because they have more metal layers than previous older nodes.
Speaker 2 So that tends to make it take anywhere from three to five months. Okay.
Speaker 1 But that's how long training runs take anyways right so you could potentially do both at the same time yeah potentially okay so i guess like you can't get sooner than three to five months but the idea that you could get like but also yeah you're like rapidly developing new algorithmic ideas right
Speaker 1 that can move fast that can move fast that can run on like existing chips and explore lots of cool ideas yeah so this isn't that like a situation in which you're like i think people sort of expect like ah there's going to be a sigmoid um
Speaker 1 again this is not a sure thing, but just like, is this a possibility? The idea that you have sort of an explosion of capabilities very rapidly towards the tail end of human intelligence that gets like
Speaker 1 smarter and smarter at more and more rapid rate.
Speaker 2
Quite possibly. Yeah.
I mean, I like to think of it like this, right? Like right now, we have models that can take a pretty complicated problem and can break it down
Speaker 2 internally in the model into a bunch of steps, can sort of puzzle together the solutions for those steps and can often give you a solution to the entire problem that you're asking it. But it
Speaker 2 isn't super reliable and it's good at breaking things down into
Speaker 2 five to 10 steps, not 100 to 1,000 steps. So if you could go from, yeah, 80% of the time it can give you a perfect answer to something that's 10 steps long to something that
Speaker 2 90% of the time can give you a perfect answer to something that's 100 to 1,000 steps of sub sub-problem long, that would be an amazing improvement in capability of these models.
Speaker 2 And, you know, we're not there yet, but I think that's what we're aspirationally trying to get to: is yeah, we don't need new hardware for that. But I mean,
Speaker 2 we'll take it. Yeah,
Speaker 2 yeah, exactly.
Speaker 2 Never look new hardware in the mouth.
Speaker 3 One of the, you know, like one of the big areas of improvement, I think, you know, in the near future is this entrance time compute, like applying more compute
Speaker 3 at inference time. And I guess the way I've liked to describe it is that
Speaker 3 even some giant language model,
Speaker 3 even if you're doing, say, a trillion operations per token, which is
Speaker 3 more than most people are doing these days,
Speaker 3 operations cost something like 10 to the negative $18.
Speaker 3 And so you're getting like a million tokens to the dollar, right? So, I mean, compare that to like a relatively cheap pastime. Like you, you, you go out and you buy a paper book and read it.
Speaker 3 You're paying like 10,000 tokens to the dollar. So it's so like talking to a language model could be like, you know, is like 100 times cheaper than reading a paperback.
Speaker 3 So there is a huge amount of headroom there to say, okay, if we can make this thing more expensive,
Speaker 3 but but smarter, because we're like two, you know, like 100x cheaper than reading a paperback. We're like 10,000 times cheaper than talking to a customer support agent.
Speaker 3 We're like a million times or more cheaper than
Speaker 3 hiring a software engineer or talking to your doctor or lawyer.
Speaker 3 Can we add, you know,
Speaker 3 add computation and
Speaker 3 make it smarter? So I think
Speaker 3 a lot of the takeoff that
Speaker 3 we're going to see in the very near future is of this form. Like we've we've been exploiting and improving pre-training a lot in the past and post-training.
Speaker 3 And those things will continue to improve, but like taking advantage of, you know, think harder
Speaker 3 at inference time is going to just be an explosion.
Speaker 2 Yeah, and an aspect of inference time is I think you want the system to be actively exploring a bunch of different potential solutions.
Speaker 2 You know, maybe it does some searches on its own and gets some information back and like consumes that information and figures out, oh, now I would really like to know more about this thing.
Speaker 2 So now it kind of iteratively kind of explores how to best solve the high-level problem you pose to this system.
Speaker 2 And I think having a dial where you can make the model give you better answers with more inference time compute. seems like we have a bunch of techniques now that seem like they can kind of do that.
Speaker 2 And the more you crank up the dial, the more it costs you in terms of compute, but but the better the answers get that that seems like a a nice trade-off to have because sometimes you want to think really hard because it's a super important problem sometimes you probably don't want to spend enormous amounts of compute to compute you know one plus what's the answer to one plus one
Speaker 2 um maybe the system
Speaker 2 comes up with like new axioms of set theory decide to use a calculator tool or something instead of
Speaker 2 you know a very large language model
Speaker 1 are there any impediments to taking inference time like
Speaker 1 having some way in which you can just linearly scale up inference time compute? Or is this basically a problem that's sort of solved?
Speaker 1 And we know how to throw like 100x compute, a 10,0x compute, and get correspondingly better results.
Speaker 3 Well, we're working out the algorithms as we speak. So
Speaker 3 I believe
Speaker 3 we'll see better and better solutions to this as these many more than 10,000 researchers
Speaker 3 are hacking at it, many of them at Google.
Speaker 2 I mean, I think we do see some examples in our own sort of experimental work of things where if you apply more inference time compute, the answers are better than if you just apply, you know,
Speaker 2 X, you know, if you apply 10X, you can get better answers than X amount of computed inference time. And that seems useful and important.
Speaker 2 But I think what we would like is when you apply 10x to get, you know, even a bigger improvement in the quality of the answers than we're getting today.
Speaker 2 And so that's about designing new algorithms, trying new approaches,
Speaker 2 figuring out how best to spend that 10x instead of x to improve things.
Speaker 1 Does it look more like search or does it look more like just keep it going in the linear direction for a longer time?
Speaker 2 I mean, I think search is,
Speaker 2 I really like Rich Sutton's paper that he wrote about the bitter lesson. And the bitter lesson effectively is this nice one-page paper, but the essence of it is
Speaker 2 you can try lots of approaches, but the two techniques that are incredibly effective are learning and search.
Speaker 2 And you can apply and scale those algorithmic or you know, computationally, and you often will then get better results than any other kind of approach you can apply to a pretty broad variety of problems.
Speaker 2 And so I think search has got to be part of the solution to spending more inference time is you want to maybe explore a few different ways of solving this problem.
Speaker 2 And like, oh, that one didn't work, but this one worked better. So now I'm going to explore that a bit more.
Speaker 1 How does this change your plans for future data center
Speaker 1 planning and so forth where if you know
Speaker 1 can this kind of search be done asynchronously? Does it have to be online, offline? How does that change how big of a campus you need and those kinds of considerations?
Speaker 2 I mean, I think one general trend is it's clear that inference time compute, you know, you have a model that's pretty much already trained and you want to do inference on it, is going to be a growing and important class of computation that maybe you want to specialize hardware more around that.
Speaker 2 You know, actually, the first TPU was specialized for inference and wasn't really designed for training. And then subsequent TPUs were really designed more around training and also for inference.
Speaker 2 But it may be that you know, when you have something where you really want to crank up the amount of compute you use at inference time that even more specialized solutions won't make a lot of sense.
Speaker 1 Does that mean you can accommodate more asynchronous training?
Speaker 2 Training or inference?
Speaker 1 Or just you can have
Speaker 1 the different data centers don't need to talk to each other. You can just like have them do a bunch of
Speaker 2 yeah I mean I think
Speaker 2 I like to think of it as is the
Speaker 2 inference that you're trying to do latency sensitive, like the user's actively waiting for it, or is it kind of a background thing? And maybe that's
Speaker 2 I have some inference tasks that I'm trying to run over a whole batch of data, but it's not for a particular user. It's just I want to run inference on it and extract some information.
Speaker 2 And then there's probably a bunch of things that we don't really have very much of right now, but you're seeing inklings of it in our deep research tool that we just released.
Speaker 2 I forget exactly when, like a week ago,
Speaker 2 where you can give it a pretty complicated high-level task like, hey, can you go off and research the history of renewable energy and all the trends and costs for wind and solar and and other kinds of techniques and put it in a table and give me a full eight-page report.
Speaker 2 And it will come back with an eight-page report with like 50 entries in the bibliography. It's pretty remarkable, but you're not actively waiting for that for one second.
Speaker 2 It takes like, you know, a minute or two to go do that.
Speaker 2 And I think there's going to be a fair bit of that kind of compute. And that's the kind of thing where you have
Speaker 2 some UI questions around, okay, if you're going to have a user with 20 of these kind of asynchronous tasks in the background happening and maybe each one of them needs to like get for more information from the user, like, I found your flights to Berlin, but there's no non-stop ones.
Speaker 2 Are you okay with
Speaker 2 a non-stop one?
Speaker 2 How does that flow work when you kind of
Speaker 2 need a bit more information and then you want to put it back in the background for it to continue doing, you know, finding the hotels in Berlin or whatever?
Speaker 2 I think it's going to be pretty interesting and inference will be useful.
Speaker 3 Inference will be useful. I mean, there's also a compute efficiency thing in inference that you don't have in training, and that
Speaker 3 in general, transformers can use the sequence length as a batch during training, but they can't really in inference because when you're generating one token at a time. So
Speaker 3 there may be different hardware and inference algorithms that we design for the purposes of being efficient at inference.
Speaker 2 Yeah, like as a good example
Speaker 2 of an algorithmic improvement is like the use of drafter models. So you have like a really small language model that you do one token at a time when you're decoding and predict like four tokens.
Speaker 2 And then you give that to the big model and you say, okay, here's the four tokens the little model came up with. Check which ones you agree with.
Speaker 2 And if you agree with the first three, then you just advance. And then you've basically been able to do a four
Speaker 2 token width parallel computation instead of a one token width thing in the big model.
Speaker 2 And so those are the kinds of things that people are looking at to improve inference efficiency.
Speaker 2 So you're not, don't have this single token decode bottleneck.
Speaker 3 Basically, the big model is being used as a verifier.
Speaker 3 Yeah. The generator and verification you can then.
Speaker 2
Right. Hello.
How are you? That sounds great to me. I'm going to advance past that.
Speaker 1 So a big discussion has been about
Speaker 1 we're already tapping out like nuclear power plants in terms of delivering power into one single campus. And so do we have to have just
Speaker 1 two gigawatts in one place, five gigawatts in one place? Or can it be more distributed and still be able to train a model?
Speaker 1 Does this new regime of infant scaling make different considerations there plausible? Or how are you thinking about multi-data center training now?
Speaker 2 I mean, we're already doing it. So
Speaker 2 we're pro multi-data center training.
Speaker 2 I think in the Gemini 1.5 tech report, we said we used multiple metro areas and trained with some of the compute in each place and then a pretty
Speaker 2 long latency, but high bandwidth connection between those data centers. And that works fine.
Speaker 2 It's great. Actually, training is kind of interesting because each step in a training process is usually for a large model is a few seconds or something at least.
Speaker 2 the latency of it being 50 milliseconds away doesn't matter that much.
Speaker 3 Just the bandwidth, you know? Yeah, just bandwidth. As long as you can
Speaker 3 sync all of the parameters of the model across the different data centers and then accumulate all the gradients. So it's in the time it takes to do one step, you're pretty good.
Speaker 2 And then we have a bunch of work on, you know, in even early brain days when we were using CPU machines and they were really slow.
Speaker 2 So we needed to do asynchronous training to help scale,
Speaker 2 where each copy of the model would kind of do some local computation and then send gradient updates to a centralized system and then apply them asynchronously, and another copy of the model would be doing the same thing.
Speaker 2 You know, it makes your model parameters kind of wiggle around a bit and it makes people uncomfortable with the theoretical guarantees, but it actually seems to work in practice.
Speaker 2 In practice, it works.
Speaker 3 It was so pleasant to go from async to sync because your experiments are now
Speaker 2 replicable, like rather than like every like your results depend on like whether there was like a web crawler running on the same machine as like one of your compute servers
Speaker 2 so I am so much happier running on like TPU pods I love async it just wants to steal two iPhones and an Xbox or whatever
Speaker 2 yeah what if we could give you asynchronous but replicatable results oh so one way to do that is you effectively record the sequence of operations so like which gradient update happened and when, and on which batch of data, you don't necessarily record the actual gradient update in a log or something, but you could replay that log of operations
Speaker 2 so that you get repeatability.
Speaker 2 Then I think you'd be happy
Speaker 2 then. Possibly.
Speaker 2 At least you could debug what happened. Yeah.
Speaker 3 But you wouldn't be able to compare necessarily two training runs because, okay, I made one change in the hyper parameter, but also like I had like a
Speaker 2 crawl on the machine.
Speaker 3 And there were like a lot of people screaming the Super Bowl at the same time.
Speaker 2 I mean,
Speaker 2 the thing that led us go from asynchronous training on CPUs to fully synchronous training is the fact that we have these super fast TPU hardware chips and then pods, which have incredible amounts of bandwidth between the chips and a pod.
Speaker 2 And then scaling beyond that, we have like really good data center networks and even cross-metro area networks that enable us to scale to many, many pods in multiple metro areas for our largest training runs.
Speaker 2 And we can do that fully synchronously, as Noam said, as long as the gradient accumulation and communication of the parameters across metro areas happens fast enough relative to the step time, you're golden.
Speaker 2 You don't really care.
Speaker 2 But I think as you scale up, there may be a
Speaker 2 push to have a bit more asynchrony in our systems than we have now because like we can make it work.
Speaker 2 I've been, you know, our ML researchers have been really happy how far we've been able to push synchronous training because it is easier mental model to understand.
Speaker 2 You know, you just have your algorithm. sort of fighting you rather than the asynchrony and the algorithm kind of like battling you.
Speaker 3 As you scale up, there are more things fighting you, you know, like there's
Speaker 3
right. That's the problem with the, you know, with scaling that you don't actually always know what it is that's fighting you.
Is it
Speaker 3 the fact that you've pushed like quantization a little too far in some place or another? Or is it your data?
Speaker 2 Or is it maybe it's your adversarial machine, MUQQ17, that is like
Speaker 2 setting the seventh bit of your exponent and all your gradients or something. Right.
Speaker 3 And all of these things just...
Speaker 3 make the model slightly worse so you don't even know that the thing is going on.
Speaker 2 So that's actually a bit of a problem with neural nets is they're so tolerant of noise.
Speaker 2 You can have things set up kind of wrong in a lot of ways, and they just kind of figure out ways to work around that or learn. And despite that, you could have bugs in your code.
Speaker 3
Most of the time, that does nothing. Some of the time it makes your model worse.
Some of the time it makes your model better.
Speaker 3 And then you discover something new because you never tried this bug at scale before because you
Speaker 3 didn't have the budget for it.
Speaker 1 What practically does it look like, actually, to debug or decode what the
Speaker 1 like you've got these things, some of the things which are making models better, some of which are making it worse?
Speaker 1 You, when you go into work tomorrow, you're like, all right, what's going on here? How do you figure out what the most salient inputs are?
Speaker 3 Right. I mean, well, at small scale, you do lots of experiments.
Speaker 3 So, I mean, there's, I think, one part of the research that involves, okay, I want to like invent these improvements or breakthroughs kind of in isolation, in which case you want a nice simple code base that you can fork and hack and like some baselines.
Speaker 3 And
Speaker 3 my dream is I wake up in the morning,
Speaker 3 come up with an idea, hack it up in a day, run some experiments, get some initial results in a day, like, okay, this looks promising. These things worked, these things worked and didn't work.
Speaker 3 And I think that
Speaker 3 is very achievable because
Speaker 3 at small scale, as long as you keep your, you know, keep a nice experimental code base.
Speaker 2 And maybe an experiment takes an hour to run or two hours or something,
Speaker 2 not two weeks. It's great.
Speaker 3
It's great. So there's that part of the research.
And then there's some amount of scaling up.
Speaker 3 And then you have the part which is like integrating, where you want to stack all the improvements on top of each other and see if they work at large scale and see if they work all in conjunction with each other.
Speaker 2 Right. You think maybe they're independent, but actually maybe there's some funny interaction between
Speaker 2 improving the way in which we handle video data input and the way in which we
Speaker 2 update the model parameters or thing. And
Speaker 2 that interacts more for video data than some other thing.
Speaker 2 There's all kinds of interactions that can happen that you maybe
Speaker 2 don't anticipate.
Speaker 2 And so you want to run these experiments where you're then putting a bunch of things together and then periodically making sure that all the things you think are good are good together.
Speaker 2 And if not, understanding why they're not playing nicely.
Speaker 1 Two questions.
Speaker 1 One, how often does it end up being the case that things don't stack up well together? Is it like a rare thing or does it happen all the time?
Speaker 3 It happens.
Speaker 2 Happens all the time.
Speaker 2 Yeah, I mean, I think most things you don't even try to stack because they,
Speaker 2 you know, the initial experiment didn't work that well or it showed results that aren't that promising relative to the baseline.
Speaker 2 And then you sort of take those things and you try to scale them up individually. And then you're like, oh, yeah, these ones seem really promising.
Speaker 2 So I'm going to now include them in something that I'm going to now bundle together and try to advance
Speaker 2 and combine with other things that seem promising.
Speaker 2 And then you run the experiments and then you're like, oh, well, they didn't really work that well. Let's try to debug why.
Speaker 3 And then there are trade-offs because you want to keep your integrated system
Speaker 3 as clean as you can because
Speaker 3 complexity complexity,
Speaker 3 code base, yeah, code base, and algorithmically, complexity, you know, complexity hurts, complexity makes things slower, introduces more risk.
Speaker 3 And then, you know, at the same time, you want to, you want it to be as good as possible. And of course,
Speaker 3 every individual researcher
Speaker 3 wants his inventions to go into it.
Speaker 2 So
Speaker 3 there are definitely
Speaker 3 challenges there, but we've been working together quite well.
Speaker 1 My sponsors, Jane Street, invented a card game called Figgy in order to teach their new traders the basics of markets and trading.
Speaker 1 I'm a poker fan, and I'd say that Figgy is like poker in the sense that there's hidden information, but it's much more intense and social.
Speaker 1 In poker, you're usually just sitting around waiting for your turn, whereas in Figgy, you spend the whole time just shouting bids and asks at the other players.
Speaker 1 The game is set up such that there's a winner in the end, of course, but during each turn, you are incentivized to find mutually beneficial trades with the other players.
Speaker 1 And in fact, that's the main skill that the game rewards. Figgy simulates the most exciting parts of trading.
Speaker 1 Jane Streeters enjoy it so much that they hold an inner office Figgy Championship every single year.
Speaker 1 You can play it yourself by downloading it on the App Store, or you can find it on desktop at f-i-g-g-ie-e.com. All right, back to Jeff and Noam.
Speaker 1 Okay, so then going back to the whole dynamic of
Speaker 1 you find better and better algorithmic improvements, uh, and the models get better and better over time, even if you take the hardware part out of it. Should the world be thinking more about,
Speaker 1 and should you guys be thinking more about this?
Speaker 1 There's one world where you just like AI is a thing that takes like two decades to slowly get better over time, and you can sort of like refine things over, you know, if like you've kind of messed something up, you fix it, uh, and it's like not that big a deal, right?
Speaker 1 It's like not that much better than the previous version you released. There's another world where you have this big feedback loop, which means that
Speaker 1 the two years between Gemini 4 and Gemini 5 are the most important years in human history because you go from
Speaker 1 a pretty good ML researcher to superhuman intelligence because of this feedback loop.
Speaker 1 To the extent that you think that second world is plausible, how does that change how you sort of approach these greater and greater levels of intelligence?
Speaker 3 I've stopped cleaning my garage because I'm waiting for the robots.
Speaker 3 So probably I'm more in the second camp of what we're going to see a lot of acceleration.
Speaker 2 Yeah, I mean, I think it's super important to understand what's going on and what the trends are.
Speaker 2 And I think right now the trends are the models are getting substantially better generation over generation.
Speaker 2 And I don't see that slowing down in the next few generations, probably. So that means the models...
Speaker 2 say two to three generations from now are going to be capable of, you know, let's go back to the example of breaking down a simple task into 10 subpieces and doing it 80% of the time to something that can break down a task, a very high-level task into 100 or 1,000 pieces and get that right 90% of the time.
Speaker 2 That's a major, major step up in what the models are capable of. So I think it's important for people to understand, you know, what's what is happening in the progress in the field.
Speaker 2 And then those models are going to be applied in a bunch of different domains.
Speaker 2 And I think it's really good to make sure that we, we, as society, get the maximal benefits from what these models can do to improve things in, you know, I'm super excited about areas like education and healthcare, you know, making information accessible to all people.
Speaker 2 But we also realize that they could be used for misinformation, they could be used for, you know, automated hacking of computer systems.
Speaker 2 And we want to sort of put as many safeguards and mitigations and understand the capabilities of the models in place as we can. And that's kind of, you know, I think Google as a whole has a really,
Speaker 2 you know, good view to how we should approach this.
Speaker 2 You know, our responsible AI principles actually are a pretty nice framework for how to think about trade-offs of making, you know, better and better AI systems available in different contexts and settings while also sort of making sure that we're doing the right thing in terms of, you know, making sure they're safe and
Speaker 1 know not saying toxic things and things like that or i guess the thing that uh stands out to me if if you were like zooming out and looking at like this period of human history if if we're in the world where like look maybe uh if you do post training on gemini 3 badly it can do some misinformation but then you like fix the post training and like it's gonna stop doing that is a it's a bad mistake but it's a fixable mistake right right whereas if um if you have this feedback loop dynamic, which is a possibility,
Speaker 1 then the sort of like mistake of like the thing that catapults this intelligence explosion is like
Speaker 1 misaligned, is
Speaker 1 like not trying to write the code you think it's trying to write and optimizing for some other objective.
Speaker 1 And on the other end of this very rapid process that lasts a couple of years, maybe less, you have things that are approaching Jeff Dean or beyond level or known Shazier beyond level.
Speaker 1 And then you have like millions of copies of Jeff Dean level programmers. And
Speaker 1 anyways, that seems like a harder to recover mistake. And that seems like a much more salient.
Speaker 1 You really got to make sure we're going into the intelligence explosion.
Speaker 3 As these systems do get more powerful,
Speaker 3 you've got to be
Speaker 3 more and more careful.
Speaker 2 I mean, one thing I would say is there's like the extreme views on either end. There's like, oh my goodness, these systems are going to be so much better than humans at all things.
Speaker 2
And we're going to be kind of overwhelmed. And then there's the like, these systems are going to be amazing and we don't have to worry about them at all.
I think I'm somewhere in the middle.
Speaker 2 And I've been a, I'm a co-author on a paper called Shaping AI, which is, you know, those two extreme views often kind of view our role as kind of laissez-faire, like we're just going to have the AI develop in the path that it takes.
Speaker 2 And I think there's actually a really good argument to be made that what we're going to do is try to shape and steer the the way in which ai is deployed in the world so that it is you know maximally beneficial in the areas that we want to capture and benefit from in you know education you know some of the areas i mentioned healthcare um
Speaker 2 and steer it as much as we can away maybe with policy related things maybe with uh you know technical measures and and safeguards away from you know the computer will you know take over and and have have unlimited control of what it can do so i think that's an engineering problem is how do you engineer safe systems i think it's kind of the modern equivalent of what we've done in uh kind of older style software development like if you look at you know air airplane software development that has a pretty good record of how do you rigorously develop safe and secure systems for for doing a pretty pretty uh risk risky task the difficulty there is that there's not some feedback loop for the 737.
Speaker 1 You put it in a box with a bunch of compute for a couple of years and it comes out with
Speaker 1 the version 1000.
Speaker 3 I think
Speaker 3 the good news is that
Speaker 3 analyzing text seems to be easier than generating text.
Speaker 3 So I believe that the sort of ability of language models
Speaker 3 to actually analyze language model output
Speaker 3 um
Speaker 3 you know and figure out what is uh what is problematic or dangerous uh
Speaker 3 you know will will actually be uh be the so the solution to um
Speaker 3 to a lot of these control issues um we've we are definitely definitely working on this stuff we've got a bunch of brilliant folks at google uh you know working on this now and you know i think it's just going to be more and more important both from you know the the both for from a you know do something good for people uh standpoint but you know also from a business standpoint that you know you you are
Speaker 3 a lot of the time like limited in you know limited in what you can deploy based on you know
Speaker 3 you know, based on
Speaker 3 keeping things safe. And it's, you know, so it becomes very, very important
Speaker 3 to be really, really good at that.
Speaker 1 Yeah,
Speaker 1 obviously I know you guys take
Speaker 1 the potential benefits and costs here seriously. And you guys get credit for it, but not enough, I think, for sure.
Speaker 1 It's like there's so many different applications that you have put out for using these models to make the different areas you talked about better.
Speaker 1 But I do think that there,
Speaker 1 again, if you have a situation where plausibly there's some feedback loop process on the other end, you have like a model that is as good as Noam Shazi or as good as Jeff Dean.
Speaker 1 If like, if there's an evil version of you running around, and suppose there's like a million of them,
Speaker 2 I think that's like really, really bad.
Speaker 1 Yeah, that's that could be like much, much worse than any other risk, maybe short of like nuclear war or something.
Speaker 1 Just think about it, like a million evil Jeff Deans or something.
Speaker 2 Where do we get the training?
Speaker 2 Yeah.
Speaker 1 But to the extent that you think that's like a plausible output of some quick feedback loop process, What is your plan of like, okay, we've got Gemini 3 or Gemini 4, and we think it's like helping us do a better job of training future versions.
Speaker 1 It's writing a bunch of the training code for us from this point forward. We just kind of like look over it, verify it.
Speaker 1 Even the verifiers you talked about of looking at the output of these models will eventually be trained by, or, you know, a lot of the code will be written by the AIs you make.
Speaker 1 You know, like, what do you want to know for sure before we have the Gemini 4 help us with AI research? We really want to make sure
Speaker 1 we want to run this test on it before we let it write our AI code code for us.
Speaker 2 I mean, I think having the system explore algorithmic research ideas seems like something where there's still a human in charge of that that it's exploring the space and then it's going to get a bunch of results and we're going to make a decision like are we going to incorporate this particular you know learning algorithm or change to the
Speaker 2 system
Speaker 2 in into kind of the core code base.
Speaker 2 And so I think you can put in safeguards like that that enable the system to enable us to get the benefits of the system that can sort of improve or kind of self-improve with human oversight without necessarily letting the system go full on self-improving without any
Speaker 2 notion of a person looking at what it's doing right that's the kind of engineering safeguards I'm talking about where you want to be kind of looking at the characteristics of the systems you're deploying, not deploy ones that are harmful by some measures and some ways, and you have in understanding what its capabilities are and what it's likely to do in certain scenarios.
Speaker 2 So,
Speaker 2 you know, I think it's not an easy problem by any means, but I do think it is possible to make these systems safe.
Speaker 3 Yeah, I mean, I think we are also going to use these systems a lot to check themselves, check other systems. You know, it's
Speaker 3 I mean, even as a human, it's it is easier to recognize something than to generate it. So,
Speaker 3 you know.
Speaker 2 One thing I would say is if you expose the model's capabilities through an API or through a user interface that people interact with, you know, I think then you have a level of control to understand how is it being used and sort of put some boundaries on what it can do.
Speaker 2 And that, I think, is one of the tools in the arsenal of like,
Speaker 3 how do you make sure that what it's going to do is sort of acceptable by some set of standards you've you've set out in your mind yeah I mean I think our goal is to to empower people but you know so for the most part you know we should be mostly letting letting people do things with these systems that that make sense and you know closing off as few parts of the space as we can but you know yeah if you if you let somebody take your thing and create a million evil software engineers, then that doesn't empower people because
Speaker 3 they're going to hurt others with a million evil software engineers so i i'm against that
Speaker 1 all right let's talk about a few more fun topics
Speaker 2 yeah that was
Speaker 2 um over the last 25 years what was the most fun time what period of time do you have the most nostalgia over i mean i think The early sort of four or five years at Google when I was sort of one of a handful of people working on search and crawling and search and indexing systems, and our traffic was growing tremendously fast, and we were trying to expand our index size and make it so we updated it, you know, every minute instead of every, you know, month or two months if something went wrong.
Speaker 2 And seeing kind of the growth and usage of our systems was really just personally satisfying. You know, building something that is used by, you know, today, 2 billion people a day, I think is
Speaker 2 pretty incredible.
Speaker 2 But I would also say equally exciting is sort of working with people in the Gemini team today.
Speaker 2 I think the progress we've been making in what these models can do over the last year and a half or whatever is really fun. People are really dedicated, really excited about what we're doing.
Speaker 2 I think the models are getting better and better at
Speaker 2
pretty complex tasks. Like if you showed someone using a computer 20 years ago what these models are capable of, they wouldn't believe it.
Right. And even five years ago, they might not believe it.
Speaker 2 And that's pretty satisfying. And I think we'll see a similar growth and usage of these models and impact in the world.
Speaker 3 Yeah, I'm with you.
Speaker 2 I'm with you.
Speaker 3 Early days were super fun.
Speaker 3 I mean, part of that is just like knowing everybody and
Speaker 3 the social aspect and the fact that you're just building something
Speaker 3 that millions and millions of people are using.
Speaker 3 Same thing today. We got that
Speaker 3 whole nice micro kitchen area where you get like lots of people hanging out.
Speaker 3
I love being in person. It's work with a bunch of great people and build something that's helping millions to billions of people.
Like, yeah, what could, what could be better?
Speaker 1 Uh, what's this a micro kitchen?
Speaker 2 Oh, we have a micro kitchen area in the building we both sit in. Uh, it's the new gradient, so named gradient canopy.
Speaker 2 It used to be named Charleston East, and we decided we needed a more exciting name because it's a lot of like machine learning researchers and AI research happening in there. Um,
Speaker 2 and there's a micro kitchen area with that we've set up with, you know, normally it's just like a an espresso machine and a bunch of snacks but this particular one has a bunch of space in it so we've set up uh like maybe 50 desks in there and so people are just hanging out in there you know it's a little noisy because people are always like grinding beans and
Speaker 2 espresso but you know you also get a lot of like face-to-face ideas of of connections like oh i've tried that like did you try think about trying this in your idea or you know oh we're going to launch this thing next week like how's the load test looking uh There's just like lots of feedback that happens.
Speaker 2 And then we have our Gemini chat rooms for people who are not in that micro kitchen. You know, we have a team all over the world.
Speaker 2 And, you know, there's probably 120 chat rooms I'm in related to Gemini things. And, you know, this particular very focused topic, we have like seven people working on this.
Speaker 2 And there's like exciting results being shared by the London colleagues.
Speaker 1 And when you wake up, you see like what's happening in there, or it's a big group of like people focused on data and there's all kinds of issues you know happening in there it's just fun uh what i find remarkable about some of the calls you guys have made uh is you're anticipating a level of demand for compute which at the time wasn't obvious or evident um tpus being a famous example of this or the first tpo being an example of this um
Speaker 1 that thinking you had in i guess 2013 or earlier if you were if you think about it that way today and you do an estimate of look we're going to have these models that are going to be a backbone of our services and we're going to be doing constantly inference for them.
Speaker 1 We're going to be trading future versions.
Speaker 1 And you think about the amount of compute we'll need by 2030 to accommodate all these use cases.
Speaker 1 Where does the Fermi estimate get you?
Speaker 2 Yeah, I mean, I think
Speaker 2 you're going to want a lot of inference compute is the rough highest level view of these capable models.
Speaker 2 Because if one of the techniques for improving their quality is scaling up the amount of inference compute you use, then all of a sudden what's currently like
Speaker 2 one request to generate some tokens now becomes 50 or 100 or 1,000 times as computationally intensive, even though it's producing the same amount of output.
Speaker 2 And you're also going to then see.
Speaker 2 tremendous scaling up of the uses of these services as not everyone in the world has discovered these chat-based conversational interfaces where you can get them to do all kinds of amazing things.
Speaker 2 You know, probably 10% of the computer users in the world have discovered that today or 20%
Speaker 2 as they
Speaker 2 that pushes towards 100%
Speaker 2
and people make heavier use of it. You know, that's going to be another order of magnitude or two of scaling.
And so you're now going to have
Speaker 2
two orders of magnitude from that, two orders of magnitude from that. The models are probably going to be bigger.
You'll get another order of magnitude or two from that.
Speaker 2 And there's a lot of inference compute you want. So you want extremely efficient hardware for inference for models you care about.
Speaker 1 In flops,
Speaker 1 total global inference
Speaker 1 in 2030.
Speaker 3 I think just more is always going to be better. Like,
Speaker 3 if you just kind of think about, okay, like
Speaker 3 what
Speaker 3 fraction of
Speaker 3 world GDP will be, you know,
Speaker 3 will people decide to spend on
Speaker 3 AI
Speaker 3 at that point? And then, like, okay, what do the AI systems look like?
Speaker 3 Well, maybe it's some sort of personal assistant-like thing that is in your glasses and can see everything around you and has access to all your digital information and the world's digital information.
Speaker 3 And, like, maybe it's like you're Joe Biden and you have the earpiece in the cabinet that can advise you about anything in real time and solve problems for you and give you helpful pointers, or you could talk to it and, you know, it wants to analyze like anything that it sees around you for any potential useful impact that it has on you.
Speaker 3 So, I mean,
Speaker 3 I can imagine, okay, and then say it's like your, okay, your personal assistant or your personal cabinet or something, and that every time you spend 2x as much money on compute, the thing gets like 5, 10 IQ points smarter or something like that.
Speaker 3 And okay,
Speaker 3 would you rather spend like $10 a day and have an assistant or $20 a day and have a smarter assistant?
Speaker 3 And not only is it an assistant in life, but an assistant in getting your job done better because now it makes you from a 10x engineer to a 100x or 10 million X engineer.
Speaker 3 I mean, okay, okay, so okay, so let's see.
Speaker 2 From first principles, right.
Speaker 3 So people are going to want to want to spend
Speaker 3 some fraction of world GDP on this thing.
Speaker 3 The world GDP is almost certainly going to go way, way up to like orders of magnitude higher than it is today due to the fact that we have all of these artificial engineers like working on improving things.
Speaker 3 Probably we will have solved unlimited energy and
Speaker 3
carbon issues by that point. So we should be able to have lots of energy.
We should be able to have millions to billions of robots like building us data centers.
Speaker 3 Like, let's see, what's like the sun is, what, 10 to the 26 watts or something like that.
Speaker 3 You know, I mean, I'm guessing that the amount of compute at the, you know, being used for AI to help each person will be astronomical.
Speaker 2 I mean, I would add on to that. I'm not sure I agree completely, but it's a pretty interesting thought experiment to go in that direction.
Speaker 2 And even if you get partway there, it's definitely going to be a lot of compute. And this is why it's super important to have as cheap
Speaker 2 a hardware platform for using these models and applying them to problems that Noam described.
Speaker 2 so that you can then make it accessible to everyone in some form and have as low a cost for access to these capabilities as you possibly can.
Speaker 2 And I think that's achievable by focusing on hardware and model co-design kinds of things. And we should be able to make these things much, much more efficient than they are today.
Speaker 1 Is Google's data center build-out plan over the next few years aggressive enough given this increase in demand you're expecting?
Speaker 2 I'm not going to comment on our future capital spending because
Speaker 2 our CEO and CFO would prefer I don't probably.
Speaker 2 But I will say, you know, you can look at our past capital expenditures over the last few years and see that we're definitely investing in this area because we think it's important
Speaker 2 and that we're, you know, we're continuing to build new and interesting innovative hardware that we think really helps us have an edge in deploying these systems to more and more people, both training them and also how do we make them usable by people for inference.
Speaker 1 One thing I've heard you talk a lot about is continual learning, the idea that you could just have a model which improves over time rather than having to start from scratch.
Speaker 1 Is there any fundamental impediment to that? Because theoretically, you should just be able to keep fine-tuning a model. Or yeah, what does that future look like to you?
Speaker 2 Yeah,
Speaker 2 I've been thinking about this more and more. And I've been a big fan of models that are sparse because I think you want different parts of the model to be good at different things.
Speaker 2 And we have, you know, our Gemini 1.5 Pro model and other models are mixture of experts style models where you now have parts of the model that are activated for some token and parts that are not activated at all because you decided this is a math oriented thing and this part's good at math and this part's good at like understanding CAD images.
Speaker 2 So
Speaker 2 that gives you this ability to have a much more capable model that's still quite efficient at inference time because it has very large capacity, but you activate a small part of it.
Speaker 2 But I think the current problem, well,
Speaker 2 one limitation of what we're doing today is it's still a very regular structure where each of the experts is kind of the same size. You know,
Speaker 2 the paths kind of merge back together very fast. They don't sort of go off and sort of have lots of different branches for mathy things that don't merge back together with the
Speaker 2 kind of cat image thing.
Speaker 2 And I think we should probably have a more organic structure in these things. I also would like it if the pieces of those model, of the model could be developed a little bit independently.
Speaker 2 Like right now, I think we have this issue where we're going to train a model.
Speaker 2 So we do a bunch of preparation work on deciding the most awesome algorithms we can come up with and the most awesome data mix we can come up with. But there's always trade-offs there.
Speaker 2 Like we'd love to include more multilingual data, but that might come at the expense expense of including less coding data.
Speaker 2 And so the model is less good at coding, but better at multilingual, or vice versa. And I think it would be really great if we could have
Speaker 2 a small set of people who care about a particular subset of languages go off and create really good training data, train
Speaker 2 a modular piece of a model that we can then hook up to a larger model that improves its capability in, say, Southeast Asian languages or in
Speaker 2 know reasoning about uh
Speaker 2 uh haskell code or something um and and then you then also have a nice software engineering benefit where you've decomposed the problem a bit uh compared to what we do today which is we have this kind of a whole bunch of people working but then we have this kind of monolithic process of starting to do pre-training on this model and if we could do that you know you could have a hundred teams around google you could have people all around the world working to improve you know languages they care about or particular problems they care about, and all collectively work on improving the model.
Speaker 2 And that's a kind of a form of continual learning.
Speaker 3 That would be so nice. You could just like glue models together, rip out pieces of models, and shove them into other, like Dr.
Speaker 2 Frank's finie
Speaker 3
kind of thing, or like you just attach a fire hose and you suck all the information out of this model. Yeah.
Shove it into another model.
Speaker 3 There is, I mean, the countervailing
Speaker 3 interest there is sort of science in terms of like, okay, we're still in the period of rapid progress.
Speaker 3 So if you want to do sort of controlled experiments and, okay, you know, I want to compare this thing to that thing because that is helping us figure out, okay, what do you want to build?
Speaker 3 So for thing, you know,
Speaker 3 in that interest, it's often best to just start from scratch.
Speaker 3 So you can compare one complete training run to another, you know, to another complete training run sort of at the practical level, because it kind of helps us figure out what to, you know, what to build in the future.
Speaker 3 And it's less exciting, but
Speaker 3 does lead to rapid progress.
Speaker 2 Yeah, I think there may be ways to get a lot of the benefits of that with kind of a version system of modularity.
Speaker 2 Like I have a frozen version of my model, and then I include a different variant of some particular module, and I want to compare its performance or train it a bit more.
Speaker 2 And then I compare it to the baseline of this thing with now version
Speaker 2 and prime of this particular module that does Haskell interpretation.
Speaker 3 Actually, that could lead to faster research progress, right? You've got some system and you do something to improve it.
Speaker 3 And if that thing you're doing to improve it is relatively cheap compared to training the system from scratch, then it could actually make
Speaker 3 research much, much cheaper and faster. Yeah.
Speaker 2 So, okay, and also
Speaker 2 more
Speaker 2
parallelizable, I think. Yeah.
Okay. Cause you across people.
Speaker 3
Okay. Let's figure it out and do that next.
Yeah.
Speaker 1 So this is
Speaker 1
this idea that this sort of casually laid out there is actually would be a big regime shift. Yeah.
This is the way things are headed.
Speaker 1 This is like, this is a sort of like very interesting prediction about
Speaker 1 you just have this like blob
Speaker 1 where things are getting pipelined back and forth. And if you want to make something better, you can do like a sort of surgical incision almost.
Speaker 2
Right. Or grow the model, add another little bit of it here.
Yeah, I've been sort of sketching out this vision for a while in the sort of pathways
Speaker 2 under the pathways name. Yeah, you've been building the
Speaker 2
infrastructure for it. So a lot of what Pathways, the system can support is this kind of twisty, weird model with like asynchronous updates to different pieces.
Yeah, we should go back and forth.
Speaker 2 And we're using Pathways to train our Gemini models, but we're not making use of some of its capabilities yet.
Speaker 2 But
Speaker 2 maybe we should.
Speaker 2 Maybe.
Speaker 3 There have been times, like, you know, like
Speaker 3
the way the TPU pods were set up. I don't know who did that, but they did a pretty brilliant job.
You know, the low-level software stack and the hardware stack, but okay, you've got your,
Speaker 3 you know, you've got your nice regular high-performance hardware, you've got these great Taurus-shaped interconnects, and then you've got the right low-level collectives, you know, the all-reduces,
Speaker 3 et cetera, which I guess came from supercomputing, but it turned out to be kind of just the right thing to
Speaker 3 build distributed deep learning on top of.
Speaker 1 Okay, so a couple of questions. One, suppose you do figure, suppose NOAA makes another breakthrough and now we've got a better architecture.
Speaker 1 Would you just take each compartment and distill it into this better architecture? And that's how it keeps improving over time.
Speaker 2 Yeah, I mean I do think distillation is a really useful tool because it enables you to kind of transform a model in its current model architecture form into a different form.
Speaker 2 You know often you use it to take a really capable but kind of large and unwieldy model and distill it into a smaller one that maybe you want to serve with really good fast latency inference characteristics.
Speaker 2 But I think you can also view this as something that's happening at the modularity, at the module level.
Speaker 2 Like maybe there'd be a continual process where you have each module and it has a few different representations of itself it has a really big one it's got a much smaller one that is continually distilling into this the small version and then the small version Once that's finished, then you sort of delete the big one and you add a bunch more parameter capacity and now start to learn all the things that the distilled small one doesn't know by training it on more data.
Speaker 2 And then you kind of repeat that process.
Speaker 1 And if you have that kind of running a thousand different places in your modular model in the background uh that seems like it would work reasonably well this could be the way they're doing inference scaling like the router decides how much uh do you want the victim yeah you can have multiple versions and like you know oh this is an easy math problem so i'm going to route it to the really tiny math distilled thing and oh this one's really hard so at least from public research it seems like it's often hard to decode what each expert is doing in mixture of expert type models if you have something like this, how would you enforce the kind of modularity that would be visible and understandable to us?
Speaker 2 Actually,
Speaker 3 in the past, I found experts to be relatively easy to understand. I mean, I don't know, the first mixture of experts paper, you could just like look at the
Speaker 2 inventor mixture of mixture.
Speaker 2 Yeah,
Speaker 3 like, yeah, you could just see, okay, like this expert, like we did, you know, a thousand, two thousand experts, okay, and this expert, like all of the was getting words referring to cylindrical objects.
Speaker 2
You know, like this one's super good at dates. Yeah.
Yeah. Talking about time.
Speaker 3 It was actually
Speaker 3 pretty
Speaker 3 easy to do.
Speaker 3 But I mean, like, not that you would need that human understanding to like figure out how to like work the thing at runtime because you you just have like some sort of learned router that's looking at the example.
Speaker 2 And I mean, one thing I would say is like, there is a bunch of work on interpretability of models and what are they doing inside.
Speaker 2 And sort of expert level interpretability is a sub-problem of that broader area.
Speaker 2 I really like some of the work that my former intern, Chris Hola, and others did at Anthropic, where they could kind of,
Speaker 2 they trained a very sparse autoencoder and were able to deduce, you know, what characteristics does some particular neuron in a large language.
Speaker 2 So they found like a Golden Gate Bridge neuron that's activated when you're talking about the Golden Gate Bridge.
Speaker 2 And I think, you know, you could do that at the expert level. You could do that at a variety of different levels and get pretty interpretable
Speaker 2
results. And it's a little unclear if you necessarily need that.
If the model is just really good at stuff, you know,
Speaker 2 we don't necessarily care what every neuron in the Gemini model is doing as long as the collective output and characteristics of the overall system are good.
Speaker 2 You know, that's one of the beauties of deep learning is you don't need to understand or hand engineer every last feature.
Speaker 1 Man, there's so many interesting implications of this that we could just keep I could just keep asking you about this.
Speaker 1 One implication is currently if you have a model that has some tens or hundreds of billions of parameters, you can
Speaker 1 serve it on like a handful of
Speaker 1 GPUs. In this system, where
Speaker 1 any one query might only make its way through a small fraction of the total parameters, but you need the whole thing sort of loaded into memory.
Speaker 1 The specific kind of infrastructure that Google has invested in with these TPUs that exist in pods of hundreds or thousands would be like immensely valuable, right?
Speaker 3
I mean, for any sort of even existing mixtures of experts, you want the whole thing in memory. Yeah.
I mean, basically, if you are, I guess there's kind of this misconception
Speaker 3 running around with like mixture of experts that, okay, the benefit is that
Speaker 3 you don't even have to go through those weights in the model,
Speaker 3 if some expert is unused. It doesn't mean that you don't have to retrieve that memory because really, in order to be efficient, you're serving at very large batch sizes.
Speaker 2 Of independent requests.
Speaker 2 Right.
Speaker 3 Of an independent request. So
Speaker 3 it's not really the case that, okay,
Speaker 3 at this step, you're either looking at this expert or you're not looking at this expert.
Speaker 3 Because if if that were the case then when you did look at the expert you would be running it at batch size one which is like massively inefficient like you've got like modern hardware right the uh the the operational intensities are whatever hundreds or uh you know so so um so so that's not what what's happening it's that you are looking at all the experts but you only have to send a small fraction of the batch through each one right but you still have a smaller batch at each expert that that goes through.
Speaker 2 And in order to get kind of reasonable balance,
Speaker 2 like one of the things that the current models typically do is they have all the experts be roughly the same compute cost.
Speaker 2 And then you run roughly the same size batches through them in order to sort of propagate the very large batch you're doing at inference time in and have good efficiency.
Speaker 2 But I think you know, you often in the future might want experts that vary in computational costs by factors of 100 or 1,000.
Speaker 2 Or maybe paths that go for many layers on one case and
Speaker 2 a single layer or even a skip connection in the other case.
Speaker 2 And there, I think you're going to want very large batches still, but you're going to want to kind of push things through the model a little bit asynchronously at inference time, which is a little easier than training time.
Speaker 2 And that's part of kind of one of the things that Pathways was designed to support is
Speaker 2 you have these components and the components can be variable cost. And you kind of can say, for this particular example, I want to go through this subset of the model.
Speaker 2 And for this example, I want to go through this subset of the model and have them kind of
Speaker 2 the system kind of orchestrate that.
Speaker 1 It also would mean that
Speaker 2 it would...
Speaker 1 take companies of a certain size and sophistication to be able to like right now you you know anybody can train a sufficiently small enough model but if we if it ends up being the case that this is the best way to train future models then you would need a company that can basically have a data center sized a data center serving a single uh quote-unquote blob or you know uh model so it would be it would be interesting change and
Speaker 3 paradigms in that way as well you definitely want to have as at least enough uh
Speaker 3 enough hbm to put to put your whole model so depending on the size of your model most likely that's how much,
Speaker 3 you know,
Speaker 3 that's how much HPM you'd want to have at a minimum. I mean, yeah.
Speaker 2 But it also means, I think, you don't necessarily need to grow your entire model footprint to be the size of a data center. You might want it to be
Speaker 2 a bit below that
Speaker 2 and then have
Speaker 2 potentially many replicated copies of one particular expert that is being used a lot so that you get better load balance. Right.
Speaker 2 So like this one's being used a lot because we get a lot of math questions. And this one on, you know, maybe it's an expert on Tahitian dance and it is called on really rarely.
Speaker 2 That one, maybe you even page out to DRAM
Speaker 2 rather than putting in an HPM.
Speaker 2 But you want the system to kind of figure all this stuff out based on load characteristics.
Speaker 1 How right now, language models, obviously, like you put in language, you get language out. Obviously, it's multimodal, but you can imagine
Speaker 1 the Pathways blog post talks about like every sort of like so many different use cases that are not obviously
Speaker 1 of this kind of auto-aggressive nature going through the same
Speaker 1 model.
Speaker 1 So could you imagine like basically Google as a company, the product is like Google search goes through this, Google Images goes through this, Gmail goes through, it's just like the server, the entire server is just this huge mixture of experts, specialized.
Speaker 2 I mean, you're starting to see some of this by having a lot of uses of Gemini models across Google that are not necessarily
Speaker 2 fine-tuned. They're just sort of
Speaker 2 given instructions for this particular use case and this feature in this product setting.
Speaker 2 So I definitely see a lot more sharing of what the underlying models are capable of across more and more services.
Speaker 2 I do think that's a pretty interesting direction to go for sure.
Speaker 1 I feel like people listening might not sort of
Speaker 1 register how
Speaker 1 interestingly prediction this is about where you guys it's like sort of like
Speaker 1 getting like known on a podcast in 2018 and being like yeah so I think like you know language models will be a thing it's like this is where things go this is actually
Speaker 1 yeah that's incredibly interesting yeah and I think you might see
Speaker 2 that might be a big base model and then you might want customized versions of that model with different modules that are added onto it for different settings that maybe have access restrictions like maybe we have an internal one for Google use for Google employees that we've trained some modules on internal data and we don't allow anyone else to use those modules, but we can make use of it.
Speaker 2 And maybe other companies, you add on other modules that are useful for that company setting and serve it in our cloud APIs.
Speaker 1 What is the bottleneck to making this sort of system viable?
Speaker 1 Is it like systems engineering? Is it ML? Is it
Speaker 2 I mean, it's a pretty different
Speaker 2 way of operating than our current Gemini development. So So I think, you know, we will
Speaker 2 explore these kinds of areas and I think make some progress on them, but we need to sort of really
Speaker 2 see evidence that it's the right way,
Speaker 2 you know, that it has a lot of benefits. Some of those benefits may be improved quality, some may be
Speaker 2 sort of less concretely measurable, like this ability to have lots of parallel development of different modules.
Speaker 2 And I think that would, but that's still a pretty exciting improvement because I think that then that would enable us to
Speaker 2 make faster progress on improving the model's capabilities for lots of different distinct areas.
Speaker 3 I mean, even the data control modularity stuff seems like really cool because then you could have like the piece of the model that's just trained for me.
Speaker 2 It knows all my private data. Like a personal module for you would be useful.
Speaker 2 Another thing might be you can use certain data in some settings, but not in other settings.
Speaker 2 And maybe we have some YouTube data that's only usable in a YouTube product surface, but not in other settings. So we can have a module that is trained on that data for that particular purpose.
Speaker 3 We're going to need like a million automated researchers to invent all of this stuff.
Speaker 2 Yeah.
Speaker 2 It's got to be great.
Speaker 1 Well,
Speaker 1 the thing itself, you know, it's like you build the blob and it like tells you how to make the blob better.
Speaker 2 Blob 2.0.
Speaker 2
Or maybe they're not even version. It's just like an incrementally growing blob.
Yeah.
Speaker 1 Okay, Jeff.
Speaker 1 Motivate for me, big picture.
Speaker 1 Why is this a good idea? Why is this the next direction?
Speaker 2 Yeah, I mean, I guess this kind of like notion of an organic, like kind of
Speaker 2 not quite so carefully, mathematically constructed machine learning model is one that's been with me for a little while.
Speaker 2 And I feel like in the development of neural nets, like the biological analog, the artificial neurons, you know, inspiration from biological neurons is a good one and has served us well in the deep learning field.
Speaker 2 And we've been able to make a lot of progress with that. But I feel like we're not necessarily looking at other things that real brains do as much as we perhaps could.
Speaker 2 And that's not to say we should exactly mimic that because silicon and you know
Speaker 2 wetware have very different characteristics and strengths. But I do think one thing we could draw inspiration, more inspiration from is
Speaker 2 this notion of having different specialized portions, part
Speaker 2 sort of areas of a model, of a brain that are good at different things. So, we have a little bit of that in a mixture of experts' models, but it's still very kind of structured.
Speaker 2 And I feel like this kind of more organic growth of expertise, and when you want more expertise of that, you kind of add some more capacity to the model there and let it learn a bit more on that kind of thing.
Speaker 2 And also, this notion of like adapting the connectivity of the model to the connectivity of the hardware is a good one.
Speaker 2 So I think you want incredibly dense connections between artificial neurons in sort of the same chip and the same HPM because that doesn't cost you that much.
Speaker 2 But then you want a smaller number of connections to
Speaker 2 nearby neurons. So like a chip away, you should have some amount of connections.
Speaker 2 And then like many, many chips away, you should have a smaller number of connections where you send over a very limited kind of bottleneck-y thing the most important things for that this part of the model is learning
Speaker 2 for other parts of the model to make use of. And even across multiple TPU pods, you'd like to send even less information, but the most salient kind of representations.
Speaker 2 And then across metro areas, you'd like to send even less. Yeah.
Speaker 1 And then that emerges organically.
Speaker 2 Yeah, I'd like that to emerge organically. Like you could hand specify these characteristics, but I think you don't know exactly what the right proportions of these kinds of connections.
Speaker 2
And so you should just let the hardware dictate things a little bit. Like if you're communicating over here and this data always shows up really early, you should add some more connections.
Yeah.
Speaker 2 Then it'll make it take longer and show up at just the right time.
Speaker 1 Well, here's another interesting implication potentially. Right now we think about
Speaker 1 the growth in AI use as a sort of horizontal.
Speaker 1 So suppose you're like, how many AI engineers will Google have working for it you think of like how many instances of Gemini 3 will be working at one time if you have this
Speaker 1 whatever you want to call this like blob
Speaker 1 and it can sort of like organically decide how much of itself to activate then it's more of
Speaker 1 like you know if you want like 10 engineers worth of output, it just activates a different pattern or a larger pattern.
Speaker 1 If you want 100 engineers of output, it's not like calling more agents or more instances. It's just calling different
Speaker 1 subsets.
Speaker 2 I think there's a notion of like how much compute do you want to spend on this particular inference.
Speaker 2 And that should vary by like factors of 10,000 for really easy things and really hard things, maybe even a million.
Speaker 2 And it might be iterative, like right, you might make a pass through the model, get some stuff, and then decide you now need to call on some other parts of the model as another aspect of it.
Speaker 2 The other thing I would say is, like, this sounds super complicated to deploy because it's like this, this weird, you know, constantly evolving thing with maybe not super optimized ways of communicating
Speaker 2 between pieces, but you can always distill from that, right? Like, so if you say, this is the kind of task I really care about,
Speaker 2 let me distill from this giant kind of like organicky thing into something that I know can be served really efficiently.
Speaker 2 And you could do that distillation process, you know, whenever you want, once a day, once an hour.
Speaker 2 And That seems like it'd be kind of good.
Speaker 3
Yeah, we need better distillation. Yeah.
Anyone out there invents amazing distillation techniques that instantly distill from a giant blob onto your phone, that would be wonderful.
Speaker 1 How would you characterize what's missing from current distillation techniques?
Speaker 3 Well, I just want it to work faster.
Speaker 2 Yeah. A related thing is I feel like we need interesting learning techniques during pre-training.
Speaker 2 Like I'm not sure we're extracting the maximal value from every token we look at with the current training objective.
Speaker 2 Maybe we should think a lot harder about some tokens. You know, when you get to the answer is maybe the model should at training time do a lot more work than
Speaker 2 when it gets to the.
Speaker 2 Right, right.
Speaker 3 Yeah,
Speaker 3 right. There's got to be some some way to get more from the same data, make it learn it forwards and backwards and what like
Speaker 2 which way, like hide it, hide some stuff this way, hide some stuff that way, make it infer from like partial information, you know, these kinds of things.
Speaker 2 I think people have been doing this in vision models for a while.
Speaker 2 Like you you distort the model or you hide parts of it and try to make it guess the bird from half like that it's a bird from this upper corner of the image or the lower left corner of the image.
Speaker 2 And that makes the task harder.
Speaker 2 And I feel like there's an analog for kind of more textual or coding related data where you want to force the model to work harder and you'll get more interesting observations from it.
Speaker 3 Yeah, the image people didn't have enough labeled data, so they had to invent all this stuff.
Speaker 2 And like they invented, I mean, dropout was invented on images, but we're not really using it for text mostly.
Speaker 2 That's one way you could get a lot more learning in a more large-scale model without overfitting is just make like 100 epochs over the world's text data and use use dropout.
Speaker 2 Yeah, we but that's pretty computationally inexpensive, but it does mean we won't run it.
Speaker 2 Like, even though people are saying, oh no, we're almost out of like textual data, I don't really believe that because I think we can get a lot more
Speaker 2 capable models out of the text data that does exist.
Speaker 3 I mean, like, a person has seen like a billion tokens.
Speaker 2 Yeah, and they're pretty good at a lot of stuff.
Speaker 2 Yes.
Speaker 1 Obviously, human data efficiency sets a lower bound on how, or I guess upper bound.
Speaker 2 One of them on how.
Speaker 2 It's an interesting data point. Yes.
Speaker 1 So there's a sort of like modus ponens, modus tollens thing here of one way to look at it is, look, LLMs have so much further to go. Therefore, we project
Speaker 1 orders of magnitude improvement and sample efficiency just if they could match humans. Another is maybe they're doing something clearly different given the orders of magnitude difference.
Speaker 1 What's your intuition of what it would take to make these models as sample efficient as humans are?
Speaker 2 Yeah, I mean, I think we
Speaker 2
should consider changing the training objective a little bit. Like just predicting the next token from the previous ones you've seen seems like not how people learn.
Right.
Speaker 2 It's a little bit related to how people learn, I think, but not entirely.
Speaker 2 Like a person might read a whole chapter of a book and then try to answer questions at the back, and that's a kind of different kind of thing.
Speaker 2 I also think we're not learning from visual data very much.
Speaker 2 You know, we're training a little bit on video data, but we're definitely not anywhere close to to thinking about training on all the visual inputs you could get.
Speaker 2 You know, so you have visual data that we haven't really begun to train on. And then I think we could extract a lot more information from every
Speaker 2 bit of data we do see. You know, I think one of the ways people are so sample efficient is they explore the world and take actions in the world and observe what happens.
Speaker 2
Right. Like you see it with very small infants, like picking things up and dropping them.
They learn about gravity from that.
Speaker 2 And that's a much much harder thing to learn when you're not initiating the action.
Speaker 2 And I think having a model that can take actions as part of its learning process would be just a lot better than just sort of passively observing a giant data set.
Speaker 1 Is Gator the future then?
Speaker 2 Something where the model can observe and take actions and observe the corresponding results seems pretty useful.
Speaker 2 I mean,
Speaker 3 people can learn a lot from thought experiments that don't don't even involve extra input.
Speaker 3 And like Einstein learned a lot of stuff from thought experiments or like Newton like went into quarantine and got an apple dropped on his head or something and invented gravity.
Speaker 3
And like mathematicians, like, you know, map. didn't have any extra input.
Chess, like, okay, like you have the thing play chess against itself and it gets good at chess.
Speaker 3 That was deep mind, but also like all it needs is the rules of chess. So like there's actually probably a lot of
Speaker 3 somehow a lot of learning that you can do
Speaker 3
even without external data. Yeah.
And then you can make it in exactly the
Speaker 3 fields that you care about.
Speaker 3 You know, of course, there's learning that will require external data, but probably maybe we can just have this thing talk to itself and make itself smarter.
Speaker 1 So
Speaker 2 here's a question I have. Yeah.
Speaker 1 What you've just laid out over the last hour is potentially just
Speaker 1 the big next
Speaker 1 paradigm shift in AI that's like a tremendously valuable insight potentially.
Speaker 1 How do you know
Speaker 1 in 2017, you released the Transformer paper on which tens, if not hundreds of billions of dollars of market value is based in other companies, not to mention all this other research that Google has released over time.
Speaker 1 which
Speaker 1 you've been like relatively generous with.
Speaker 1 In retrospect, when you think about divulging this information that has been helpful to your competitors, in retrospect, is it like, yeah, we'd still do it?
Speaker 1 Or would you be like, oh, we didn't realize how big a deal of choice over it was, we should have kept it indoors.
Speaker 1 How do you think about that?
Speaker 3 It's a good question. Because I think probably
Speaker 3 we
Speaker 3 did need to see the size of the opportunity often reflected in
Speaker 3
what other companies are doing. And also, it's not a fixed pie.
this is like the current state of the world is
Speaker 3 pretty much as far from fixed pie as you can get. I think we're going to see like orders of magnitude of improvements in
Speaker 3 GDP help well than anything else you can think of. So,
Speaker 3 you know, I think it's
Speaker 3 definitely been nice that Transformer has
Speaker 3 got around and
Speaker 3 transpirative.
Speaker 3 Thank God Google's doing well as well. So
Speaker 3 these days we do publish a little less of
Speaker 3 what we're doing. But
Speaker 2 yeah, I mean, I think there's always this trade-off
Speaker 2 of, you know, should we publish exactly what we're doing right away? Should we put it in
Speaker 2 the next stages of research and then roll it out into like production Gemini models and not publish it at all? Or is there some intermediate point?
Speaker 2 And for example, in our computational photography work in pixel cameras,
Speaker 2 we've often taken the decision to develop interesting new techniques like the ability to do
Speaker 2 super, super good night sight
Speaker 2 vision for low light situations or whatever, put that into the product, and then published a real research paper about the system that does that after the product is released.
Speaker 2 And I think know, different techniques and
Speaker 2 developments
Speaker 2 have different treatments, right? Like, so some things we think are super critical, we might not publish.
Speaker 2 Some things we think are really interesting, but important for improving our products, we'll get them out into our products and then make a decision.
Speaker 2 You know, do we publish this or do we give kind of a lightweight discussion of it, but maybe not every last detail?
Speaker 2 And then other things I think we publish openly and try to advance the field and the community because that's how we all all kind of benefit from you know participating.
Speaker 2 You know, I think it's great to go to conferences like NERIPS last week with like 15,000 people, you know, all sharing lots and lots of great ideas. And, you know, we published a lot of papers there
Speaker 2 as we have in the past. And you know, see the field advance is super exciting.
Speaker 1 How would you account for it?
Speaker 1 So obviously, Google had all these insights internally
Speaker 1 rather early on,
Speaker 1 including the top researchers. And
Speaker 1 now, as of 2024,
Speaker 1 Gemini 2 is out. We didn't get a chance much to talk about,
Speaker 1 but people will know, like, it's a really great model. That's what I used to read for this.
Speaker 2 As we say around the micro kitchen, such a good model. Such a good model.
Speaker 1
So it's top and LMSys, Chatbot Arena. And so now Google's on top.
But how would you account for basically coming up with all the great insights?
Speaker 1 For a couple of years, other competitors had models that were more
Speaker 1 that were better for a while, despite that.
Speaker 2 Would you take us out? Sure. I mean, I think,
Speaker 2 yeah, we've been working on language models for a long time. You know,
Speaker 2 GNOME's early work on spelling correction in 2001,
Speaker 2 the work on translation, very large-scale language models in 2007, and seek to seek and Word2Vec, and more recently, Transformers, and then BERT, and
Speaker 2 things like
Speaker 2 the internal
Speaker 2 MENA system for that was actually a chatbot-based system designed to kind of engage people in interesting conversations.
Speaker 2 We actually had an internal chatbot system that Googlers could play with
Speaker 2 even before ChatGPT came out.
Speaker 2 And actually, during the pandemic, a lot of Googlers would enjoy spending, you know, everyone was locked down at home.
Speaker 2 And so they'd enjoy spending time chatting with Mina during lunch because it was like a nice engaging ledge partner.
Speaker 2 And I think one of the things we were a little,
Speaker 2 you know, our view of things from a search perspective was like these models hallucinate a lot and they don't get things right correctly a lot of the time or some other time.
Speaker 2 And that means that they aren't as useful as they could be. And so we'd like to make that better.
Speaker 2 And, you know, from a search perspective, you want to get the right answer, you know, 100% of the time, ideally, you're going to be very high on factuality. And these models were not near that far.
Speaker 2 But I think what we were a little unsure about is that they were incredibly useful.
Speaker 2 Oh, and they also had all kinds of safety issues, like they might say offensive things, and you had to work on that aspect and get that to a point where we felt comfortable releasing the model.
Speaker 2 But I think what we kind of didn't quite appreciate was how useful they could be for things you wouldn't ask a search engine, right?
Speaker 2 Like help me write a no to my veterinarian or like, you know, can you take this text and give me a quick summary of it or whatever.
Speaker 2 And I think that's the kind of thing we've seen people really, you know, flock to in terms of using chatbots as amazing new capabilities rather than as a pure search engine.
Speaker 2 And so I think, you know, we took our time and got to the point where we actually released, you know, quite capable chatbots and have been improving them. through Gemini models
Speaker 2 quite a bit. And I think that's actually not a bad path to have taken.
Speaker 2
Would we like to have released a chatbot earlier, maybe? But I think we have a pretty awesome chatbot with awesome Gemini models that are getting better all the time. And that's pretty cool.
Yeah.
Speaker 2 So we've discussed some of the things you guys have worked on over the last 25 years. And
Speaker 2 there's so many different fields, right? You start off with search and indexing to distributed systems to hardware to AI algorithms. And genuinely, there's like a thousand more.
Speaker 2 Just go on either of their Google Scholar pages or something.
Speaker 2 What is a trick to having this level of not only career longevity where you're having, you have many decades of making breakthroughs, but also the breadth of different fields?
Speaker 2 Both of you would have in either order.
Speaker 2 What's trick to career longevity and breadth? Yeah, I mean, I think one
Speaker 2 thing that I have, that I like to do is to find out about a new and interesting area.
Speaker 2 And one of the best ways to do that is to pay attention to what's going on, talk to colleagues, like pay attention to research papers that are being published, look at the kind of research landscape as it's evolving,
Speaker 2 be willing to say, oh,
Speaker 2 you know, chip design. I wonder if we could use reinforcement learning for some aspect of that and be able to dive into
Speaker 2 a new area, work with people who know a lot about a different domain
Speaker 2 or health AI for healthcare is something I've done on the pit of already, you know, working with clinicians about what are the real problems, you know, how could AI help?
Speaker 2 You know, it wouldn't be that useful for this thing, but it would be super useful for this, getting those insights.
Speaker 2 And often working with like a set of five or six colleagues who have different expertise than you do
Speaker 2 enables you to collectively do something that none of you could do individually. And then some of their expertise rubs off on you and some of your expertise rubs off on them.
Speaker 2 And now you have like this bigger set of tools in your tool belt as an engineer and researcher to go tackle the next thing.
Speaker 2 And I think that's that's one of the beauties of you know continuing to learn on the job. It's something I treasure and I really like and enjoy diving into new things and seeing what we can do.
Speaker 2 I'd say, like, probably a big thing is like humility. Like, I'd say I'm like the most connival
Speaker 2 ever.
Speaker 2 But seriously, you know, there's,
Speaker 2 you know, to say, hey, you know, what I just did is nothing compared to what I can do or what can be done. And to be able to drop an idea as soon as you see something,
Speaker 3 as soon as you see something better, like you hear somebody, you know, with some better idea and you see how maybe
Speaker 2 what you're thinking about, what they're thinking about, or something totally different can,
Speaker 2 you know,
Speaker 2
it could conceivably work better. better.
Because I think
Speaker 2 there is a drive in some sense to say, hey, the thing I just invented is awesome, like give me more chips,
Speaker 2 particularly if there's a lot of top-down resource assignment. But
Speaker 2 I think we also need to,
Speaker 2 you know,
Speaker 2 you know, incentivize people to say, hey, this thing I am doing is not working at all.
Speaker 2 let me just drop it completely and uh you know try and try something else which i think google brain did did quite well with uh we had a very kind of bottoms up uh
Speaker 2 ubi kind of uh
Speaker 2 the chip allocation where you could do
Speaker 2 yeah it's like basically everyone had one credit and you could pool them
Speaker 2 yeah and then gemini i mean it has been like mostly top down which has been very good in some sense because it has led to a lot more collaboration and people working together.
Speaker 2 You less often have like five groups of people all building the same thing or building interchangeable things.
Speaker 2 But on the other hand, it does lead to some incentive to
Speaker 2 say, hey, what I'm doing is working great. And then
Speaker 2 like as a lead, you hear like hundreds of groups and everything is randomly.
Speaker 2 so you should give uh give them more chance and there's less of an incentive to to say hey what what i'm doing is not actually working that well let me try let me try something different so i think going forward we're going to have you know some amount of top-down some amount of uh bottom-up so as to incentivize sort of both of these behaviors collaboration and like flexibility because i think that i think both of those things lead to you know a lot of innovation yeah i think it's also good to kind of articulate interesting directions you think we should go.
Speaker 2 And, you know,
Speaker 2 I have an internal slide deck called Go Jeff Wacky Ideas
Speaker 2 that I think is like, those are a little bit more like product D oriented things of like, hey, I think now that we have these capabilities, we could do these, you know, 17 things.
Speaker 2 And, you know, I think that's a good thing because some. Sometimes people get excited about that and want to start working with you on
Speaker 2 one or more of them. And I think that's a good way to kind of bootstrap, you know,
Speaker 2
where we should go without necessarily ordering people we must go here. Yeah.
Hey, this is great. Thank you.
Speaker 2 I appreciate you taking the time. And it was great, great laugh.