![Will scaling work? [Narration]](https://substackcdn.com/feed/podcast/69345/post/140840266/05a86be7f4b61d2bdf484bbf5ea3c9c3.jpg)
Will scaling work? [Narration]
This is a narration of my blog post, Will scaling work?.
You read the full post here: https://www.dwarkeshpatel.com/p/will-scaling-work
Listen on Apple Podcasts, Spotify, or any other podcast platform. Follow me on Twitter for updates on future posts and episodes.
Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Listen and Follow Along
Full Transcript
Hey everyone, this is a narration of a blog post I wrote called Will Scaling Work? You can find the full version on my website, dhwarkashpatel.com. It was originally published December 26th, 2023.
Will Scaling Work? When should we expect AGI? If we can keep scaling LLMs++ and get better and more general performance as a result, then there's reason to expect powerful AIs by 2040 or much sooner, which can automate most cognitive labor and speed up further AI progress. However, if scaling doesn't work, then the path to AGI seems much longer and more intractable, for reasons I explain in the post.
In order to think through both the pro and the con arguments about scaling, I wrote the post as a debate between two characters I made up, believer and skeptic. When will we run out of data? Skeptic.
We're about to run out of high-quality language data next year. Even taking hand-wavy scaling curves seriously implies that we'll need 1A35 flops for an AI that is reliable and smart enough to write a scientific paper.
And that's table stakes for the abilities an AI would need to automate further AI research and continue progress when scaling becomes infeasible. Which means we need 5 OOMS, that is, orders of magnitude, more data than we seem to have.
I'm worried that when people hear 5 OOMS off, how they register it is, oh, we have 5x less data than we need, we just need a couple of 2x improvements in data efficiency, and we're golden. After all, what's a couple of ooms between friends? No, 5 ooms off means we have 100,000 times less data than we need.
Yes, we will get slightly more data efficient algorithms, and multimodal training will give us more data, plus you can recycle tokens on multiple epochs and use curriculum learning. But even if we assume the most generous possible one-off improvements that these techniques are likely to give, they do not grant us the exponential increase in data required to keep up with the exponential increase in compute demanded by these scaling laws.
So then people say, we'll get self-play synthetic data working somehow. But self-play has two very difficult challenges.
One, evaluation. Self-play worked for off-a-go since the model could judge itself based on a concrete win condition.
Did I win this game of go? But novel reasoning doesn't have a concrete win condition. And as a result, just as you'd expect, LLMs are incapable so far of correcting their own reasoning.
2. Compute.
All these math code approaches tend to use various sorts of tree search, where you run an LLM on each node repeatedly. AlphaGo's compute budget is staggering for the relatively circumscribed task of winning at Go.
Now imagine that instead of searching over the space of Go moves, you need to search over the space of all possible human thought. All this extra compute needed to get self-play to work is in addition to the stupendous compute increase already required to scale the parameters themselves.
Using the 1E35 flop estimate for human-level thought,
we need 9 OOMS more compute atop the biggest models we have today.
Yes, you'll get improvements from better hardware and better algorithms,
but will you really get a full equivalent of 9 OOMS?
Believer.
If your main objection to scale working is just a lack of data, your intuitive reaction should not be, well, it looks like we could have produced AGI by scaling up a transformer++, but I guess we're gonna run out of data first. Your reaction should be, Holy fuck, if the internet was a lot bigger, scaling up a model whose basic structure I
can- but I guess we're going to run out of data first. Your reaction should be, holy fuck, if the internet was a lot bigger, scaling up a model whose basic structure I can write down in a few hundred lines of Python code would have produced a human-level mind.
It's a crazy fact about the world that it's this easy to make big blobs of compute intelligent. The sample over which LLMs are inefficient is mostly just irrelevant e-commerce junk.
We compound this disability by training them on predicting the next token, a loss function which is almost completely unrelated to the actual task we want an intelligent agent to do in the economy. And despite this minuscule intersection between the abilities we actually want and the terrible loss function and data we train these models with, we can produce a baby AGI, that is GPT-4, by throwing just 0.03% of Microsoft's yearly revenues at a big scrape of the internet.
So, given how easy and simple AI progress has been so far, we shouldn't be that surprised if synthetic data also just works.
After all, the models just want to learn.
GPT-4 has been out for all of eight months.
The other AI labs are not only now getting their GPT-4 level models,
which means all the researchers are only now getting around to making self-perlay work with current generation models.
And it seems like one of them might have already succeeded.
There are no other people who are not going to be able to do that. Which means all the researchers are only now getting around to making self-prolay work with current generation models.
And it seems like one of them might have already succeeded. Therefore, the fact that so far we don't have public evidence that synthetic data has worked at scale doesn't mean it can't.
After all, RL becomes much more feasible when your base model is capable enough to get the right answer at least some of the time. Because now you can reward that 1 in 100 times that the model accomplishes the chain of thought required for an extended map proof.
Or writes the 500 lines of code needed to complete a full pull request. Soon, your 1 in 100 success rate becomes 10 in 100, then 90 in 100.
Now you can try the 1000 light and pull request, and not only will the model sometimes succeed, but it will be able to critique itself when it fails. And so on.
In fact, this synthetic data bootstrapping seems almost directly analogous to human evolution. Our primate ancestors show little evidence of being able to rapidly discern and apply new insights.
But, once humans develop language, you have this genetic-cultural coevolution which is very similar to the synthetic data self-play loop for LLMs, where the model gets smarter in order to better make sense of the complex symbolic outputs of similar copies. Self-play doesn't require models to be perfect at judging their own reasoning.
They just have to be better at evaluating reasoning than at doing it de novo, which clearly already seems to be the case. See Constitution AI for example.
Or play around with GPT for a few minutes and notice that it's better at explaining why what you wrote down is wrong than it is at coming up with the right answer for itself. Almost all the researchers I talk to in the big AI labs are quite confident that they'll get self-play to work.
And when I ask why they're so sure, they heave for a moment as if they're bursting to explain all their ideas. but then they remember the confidentiality is the thing and say, I can't tell you the specifics, but there's so much low-hanging fruit in terms of what we can try here.
Or as Dario Amodei, the CEO of Anthropic, told me on my podcast, and this is me asking the question, you mentioned that data is likely not to be the constraint. Why do you think that is the case? And Dario responding There's various possibilities here, and for a number of reasons I shouldn't go into the details.
But there's many sources of data in the world, and there's many ways that you can also generate data. My guess is that this will not be a blocker.
Maybe it would be better if it was, but it won't be. Skeptic.
Constitutional AI, RLHF, and other RL self-play setups are good at bringing out latent capabilities or suppressing them when those capabilities are naughty, but no one has demonstrated a method to actually increase the model's underlying abilities with RL. If some kind of self-play synthetic data doesn't work, you're absolutely fucked.
There's no other way around the data bottleneck. A new architecture is extremely unlikely to provide a fix.
You would need a jump in sample efficiency much bigger than even LSTMs to transformers. And LSTMs were invented all the way back in the 90s.
So you need a bigger jump than we have gotten out of the past 20 years when all the low-hanging fruit and deep learning has been most accessible. The vibes you're receiving from the people who have an emotional or financial interest in seeing LLM scale can substitute for the complete lack of evidence we have that RL can fix the many OOMS shortfall in data.
Furthermore, the fact that LLMs seem to need such a stupendous amount of data to get such mediocre reasoning indicates that they simply are not generalizing. If these models can't get anywhere close to human-level performance
with the data a human would see in 20,000 years, we should entertain the possibility that 2 billion years' worth of data also wouldn't do the trick. There's no amount of jet fuel that you can add to an airplane to make it reach the moon.
Next topic, has scaling actually even worked so far?
Believer What are you talking about?
Performance on benchmarks has scaled consistently for 8 orders of magnitude. The loss in model performance has been precise down to many decimal places over million-fold increases in compute.
In the GPT-04 technical report, they say that they were able to predict the performance of the final GPT-04 model for models trained using the same methodology but using, at most, 10,000 times less compute than GPT-04. We should assume that a trend which has worked so consistently for the last 8 ooms will be reliable for the next date, and the performance which you would achieve from a further 8 oom scale-up, or what in performance terms would be equivalent to an 8 Oom scale-up given the free performance boost we get from algorithmic and hardware progress, would likely result in models that are capable enough to speed up AI research.
Skeptic. But of course, we don't actually care directly about performance on next token prediction.
The models already have humans beat on this loss function. We want to find out whether these
scaling curves on next token prediction actually correspond to true progress towards generality.
Believer, as you scale these models, their performance consistently and reliably improves
on a broad range of tasks as measured by benchmarks like MMLU, Big Bench, and Human Eval. Skeptic.
But have you actually tried looking at a random sample of MMLU or Big Bench questions? They are almost all just Google search first hit results. They are good tests of memorization, not of intelligence.
Here are some questions I picked randomly from MMLU. And remember, these are multiple choice.
The model just has to choose the right answer from a list of four.
Question. Which of the following is always true of a spontaneous process?
Answer. The total entropy of the system plus surrounding increases.
Question. Who was president of the United States when Bill Clinton was born? Answer.
Harry Truman. Now, why is it impressive that a model trained on internet text full of random facts happens to have a lot of random facts memorized? And why does that in any way indicate intelligence or creativity? And even on these contrived and orthogonal benchmarks, performance seems to be plateauing.
Google's new Gemini Ultra model is estimated to have almost 5x more compute than GPT-4, but it has performed almost equivalently on MMLU and Big Bench and other standard benchmarks. In any case, common benchmarks don't at all measure long horizon task performance.
For example, can you do a job over a course of a month? Where LLMs trained on next-door prediction have very few effective data points to learn from. Indeed, as we can see on their performance on SWE Bench, which measures if LLMs can autonomously complete pull requests, they're pretty terrible at integrating complex info over long time horizons..
GPT4 gets a measly 1.7%, but CLOT2 gets a slightly more impressive 4.8%. So we seem to have two kinds of benchmarks.
The ones that measure memorization, recall, interpolation, and these are MMLU, Big Ben, Schumann, Eval, where these models are already appearing to match or even beat the average human. These tests clearly cannot be a good proxy for intelligence because even a scale maximalist has to admit that models are currently much dumber than humans.
And the other type of benchmark we have are the ones that truly measure the ability to autonomously solve problems across long-time horizons or difficult abstractions. This is SWE Benchmark, ARC, where these models aren't even in the running.
What are we supposed to conclude about a model which, after being trained on the equivalent of 20,000 years of human input, still doesn't understand that if Tom Cruise's mother is Mary Lee Pfeiffer, then Mary Lee Pfeiffer's son is Tom Cruise, or whose answers are so incredibly contingent in the way and order in which the question is phrased. So it's not even worth asking yet whether scaling will continue to work.
We don't even seem to have evidence that scaling has worked so far. Believer.
Gemini just seems like a bizarre place to expect a plateau. GPT-4 has clearly already broken through all the pre-registered critiques of connectionism and deep learning by skeptics.
The much more plausible explanation for the performance of Gemini relative to GPT-4 is just that Google has not fully caught up to OpenAI's algorithmic progress. If there was some fundamental hard ceiling on deep learning and LLMs, shouldn't we have seen it before they started developing common sense, early reasoning, and the ability to think across abstractions? What is a prima facie reason to expect some stubborn limit only between mediocre reasoning and advanced reasoning? Consider how much better GPT-4 is than GPT-3.
That's just 100x scaleup, which sounds like a lot until you consider how much smaller that is than the additional scaleup which we could throw at these bottles. We can afford a further 10,000x scaleup on GPT-4, i.e.
something that's GPT-6 equivalent, before we even touch 1% of world GDP. And that's before we account for the pre-training compute efficiency gains, things like mixture of experts, flash attention, new post-training methods, RLAI, fine-tuning on chain of thought, self-play, etc., and hardware improvements.
Each of these will individually contribute as much to performance as you would have gotten from many ooms of raw scale-up, and they have consistently done so in the past. Add all these together, you can probably convert 1% of GDP into a GPT-8 level model.
For context on how much societies are willing to spend on new general-purpose technologies,
1. British railway investment at its peak in 1847 was a staggering 7% of GDP.
2. In the five years after the Telecommunications Act of 1996 went into effect,
telecommunications companies invested more than $500 billion,
that's almost a trillion dollars in today's value, into laying fiber laying fiber optic cable, adding new switches, and building wireless networks. It's possible that GPT-8, aka a model which has the performance of 100 million times scaled up GPT-4, will only slightly be better than GPT-4.
But I don't understand why you would expect that to be the case, when we already see models figuring out how to think and what the world is like from far smaller scale-ups. You know the story from here.
Millions of GPT-8 copies coding up kernel improvements, finding better hyperparameters, giving themselves boatloads of high-quality feedback for fine-tuning, so on. This makes it much cheaper and easier to develop GPT-9.
Extrapolate this all the way out to the singularity. Next topic, do models understand the world? Believer.
To predict the next token, an LLM has to teach itself all the regularities about the world which lead to one token following another. To predict the next paragraph in a passage from the selfish gene requires understanding the gene-centered view of evolution.
To predict the next passage in a new short story requires understanding the psychology of human characters. And so on.
If you train an LLM on code, it becomes better at reasoning in language. Now this is just a really stunning fact.
What this tells us is that the model has squeezed out some deep general understanding of how to think from reading a shit ton of code, that not only is there some shared logical structure between language and code, but that unsupervised gradient descent can extract this structure and make use of it to be able to better reason. Gradient descent tries to find the most efficient compression of its data.
The most efficient compression is also the deepest and most powerful. The most efficient compression of a physics textbook, the one that would likely help you predict how a truncated argument from that book is likely to proceed, is just a deeply internalized understanding of the underlying scientific explanations.
Skeptic. Intelligence involves, among other things, the ability to compress.
But the compression itself is not intelligence. Einstein is smart because he can come up with relativity.
But Einstein and relativity is not a more intelligent system in the sense that seems meaningful to me. It doesn't make sense to say that Plato was an idiot compared to me plus my knowledge because he didn't have a modern understanding of biology or physics.
So if LLMs are just the compression made by another process, stochastic gradient descent, then I don't know why that tells us anything about the LLM's own ability to make compressions, and therefore why that tells us anything about the LLM's intelligence. Believer, an airtight theoretical explanation for why scaling must keep working is not necessary for scaling to keep working.
We didn't develop a full understanding of thermodynamics until a century after the steam engine was invented. The usual pattern in the history of technology is that invention precedes theory, and we should expect the same of intelligence.
There's not some law of physics which says that Moore's law must continue. And in fact, there are always new practical hurdles which imply the end of Moore's law.
Yet every couple of years, researchers at TSMC, NVIDIA, Intel, etc., figure out how to solve these problems and give the decades-long trend an extra lease on life. You can do all this mental gymnastics, but compute in data bottlenecks, the true nature of intelligence, and the brittleness of benchmarks.
Or you can just look at the fucking line. And the line here is a graphic that shows the transistor count over time, and you know the Moore's Law famous exponential growth.
Conclusion. Alright, enough with the alter egos.
Here's my personal take. If you were a scale believer over the last few years, the progress we've been seeing would have just made more sense.
There is a story you can tell about how GPT-4's amazing performance can be explained by some idiom library or lookup table which we'll never generalize. But that's a story that none of the skeptics pre-registered.
As for the believers, you have people like Ilya, Dario, Guern, etc. more or less spelling out the slow takeoff we've been seeing due to scaling as early as 12 years ago.
It seems pretty clear that some amount of scaling can get us a transformative AI. Which is to say, if you achieve the irreducible loss on these scaling curves, you've made an AI that's smart enough to automate most cognitive labor, including the labor required to make smarter AIs.
But most things in life are harder than in theory, and many theoretically possible things have just been intractably difficult for some reason or another. Fusion power, flying cars, nanotech, etc.
If self-play synthetic data doesn't work, then the models look fucked. You're never going to get anywhere near that platonic irreducible loss.
Also, the theoretical reason to expect scaling to keep working is murky, and the benchmarks on which scaling seems to lead to better performance have debatable generality. So, my tentative probabilities are 70% scaling plus algorithmic progress plus hardware advances will get us to AGI by 2040.
30%, the skeptics are right. LLMs, in anything even roughly in that vein, is fucked.
I'm probably missing some crucial evidence. The AI labs are simply not releasing that much research, since any insights about the science of AI would leak ideas relevant to building the AGI.
A friend who is a researcher at one of these labs told me that he misses his undergrad habit of winding down with a bunch of papers. Nowadays, nothing worth reading is published.
For this reason, I assume that the things I don't know would shorten my timelines. Also, for what it's worth, my day job is a podcaster.
But the people who could write a better post are prevented from doing so, either by confidentiality or opportunity cost. So give me a break and let me know what I missed in the comments.
Appendix. Here are some additional considerations.
I don't feel I understand these topics well enough to fully make sense of what they imply for scaling. Will models get insight-based learning? Believer, at a larger scale, models would just naturally develop more efficient meta-learning methods.
Grokking only happens when you have a large over-parameterized model and beyond the point at which you've trained it to be severely overfit on the data. Grokking seems very similar to how we learn.
We have intuitions and mental models of how to categorize new information, and over time, with new observations, those mental models themselves change. Gradient descent over such a large diversity of data will select for the most general and extrapolative circuits.
Hence, we get grokking. Eventually, we'll get insight-based learning.
Skeptic. Neural networks have grokking, but that's orders of magnitude less efficient than how humans actually integrate new explanatory insights.
You teach a kid that a sun is at the center of the solar system, and that immediately changes how he makes sense of the night sky. But you can't just feed a single copy of Copernicus into a model untrained on any astronomy, and have it immediately incorporate that insight into all relevant future outputs.
It's bizarre that the model has to hear information so many times in so many different contexts to grok the underlying concepts. Not only have models never demonstrated insight learning, but I don't see how such learning is even possible given the way we train neural networks with gradient descent.
We give them a bunch of very subtle nudges with each example, with the hope that enough such nudges will slowly push them atop the correct toe. Insight-based learning requires an immediate drag and drop from sea level to the top of Mount Everest.
Does primate evolution give evidence of scaling? Believer, I'm sure you could find all sorts of these embarrassing fragilities in chimpanzee cognition, which are far more damning than the reversal curse. Doesn't mean there was some fundamental limit on primate brains that couldn't be fixed by a 3x scale-up plus and fine-tuning.
Indeed, as Susanna Herculano Huzel has shown, the human brain has as many neurons as you'd expect from a scaled-up primate brain for the mass of a human brain to have. Rodent and insectivore brains have much worse scaling loss.
Relatively bigger brain species in those orders have far fewer neurons than you would expect just from their brain mass. This suggests that there's some primate neural architecture that's really scalable in comparison to the brains of other species, analogous to how transformers have better scaling laws than LSTMs and RNNs.
Evolution learned, or at least stumbled upon, the bitter lesson when designing primate brains, and the niche in which primates were competing strongly rewarded marginal increases in intelligence. You have
to make sense of all this data coming in from your binocular vision, your tool using hands,
and all these other smart monkeys who can talk to you. All right, that's a full post.
Thanks for
listening. And again, the full blog post and other posts you can find at my website,
dhwarkashpatal.com. All right, see you next time.