Francois Chollet, Mike Knoop - LLMs won’t lead to AGI - $1,000,000 Prize to find true solution

June 11, 2024 1h 33m

▶️ Listen to episode Download audio (MP3)

Here is my conversation with Francois Chollet and Mike Knoop on the $1 million ARC-AGI Prize they're launching today.

I did a bunch of socratic grilling throughout, but Francois’s arguments about why LLMs won’t lead to AGI are very interesting and worth thinking through.

It was really fun discussing/debating the cruxes. Enjoy!

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here.

Timestamps

(00:00:00) – The ARC benchmark

(00:11:10) – Why LLMs struggle with ARC

(00:19:00) – Skill vs intelligence

(00:27:55) - Do we need “AGI” to automate most jobs?

(00:48:28) – Future of AI progress: deep learning + program synthesis

(01:00:40) – How Mike Knoop got nerd-sniped by ARC

(01:08:37) – Million $ ARC Prize

(01:10:33) – Resisting benchmark saturation

(01:18:08) – ARC scores on frontier vs open source models

(01:26:19) – Possible solutions to ARC Prize

Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Jump to transcript

Listen and follow along

Speed:

Transcript

Okay, today I have the pleasure to speak with Francois Chollet, who is an AI researcher at Google and creator of Keras. And he's launching a prize in collaboration with Mike Knuth, the co-founder of Zapier, who we'll also be talking to in a second.
A million dollar prize to solve the ARK benchmark that he created. So first question, what is the ARK benchmark and why do you even need this prize? Why won't the biggest LLM we have in a year be able to just saturate it? Sure.
So Arc is intended as a kind of IQ test for machine intelligence. And what makes it different from most LLM benchmarks out there is that it's designed to be resistant to memorization.
So if you look at the way LLMs work, they're basically this big interpolative memory. And the way you scale up their capabilities is by trying to cram as much knowledge and patterns as possible into them.
And by contrast, Arc does not require a lot of knowledge at all. It's designed to only require what's known as core knowledge, which is basic knowledge about things like elementary physics, objectness, counting, that sort of thing.
The sort of knowledge that any four-year-old or five-year-old possesses. But what's interesting is that each puzzle in ARK is novel.
It's something that you've probably not encountered before even if you've memorized the entire internet. And that's what makes it, sorry, that's what makes ARK challenging for LLMs.
And so far, LLMs have not been doing very well on it. In fact, the approaches that are working well are more towards discrete program search, program synthesis.
So, first of all, I'll make a comment that I'm glad that as a skeptic of LLM, you have put out yourself a benchmark that, is it accurate to say that suppose that the biggest model we have in a year is able to get 80% on this, then your view would be, we are on track to AGI with LLMs. How would you think about that? Right.
I'm pretty skeptical that we're going to see LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved.
If you just train the model and millions or billions of puzzles similar to Arc, so that you're relying on the ability to have some overlap between the tasks that you train on and the tasks that you're going to see at test time, then you're still using memorization, right? And maybe it can work. You know, hopefully, Arc is going to be good enough that it's going to be resistant to this sort of attempt and brute forcing.
But, you know, you never know. Maybe it could happen.
I'm not saying it's not going to happen. Arc is not a perfect benchmark.
Maybe it has flaws. Maybe it could be hacked in that way.
So I guess I'm curious about what would GPT-PHY have to do that you're very confident that, you know, it's on the path to AGI? What would make me change my mind about that Alarmus is basically, if I start seeing a critical mass of cases where you show the model with something it has not seen before, a task that's actually novel from the perspective of it's training data, something that's not in training data, and if it can actually adapt on the fly. And this is true for Alalems, but really this would catch my attention with any AI technique out there.
If I can see the ability to adapt to novelty on the fly to pick up new skills efficiently, then I would be extremely interested. I would think this is on the path to AGI.
So the advantage they have is that they do get to see everything. Maybe I'll take issue with how much they are relying on that.
But let's suppose that they are relying, obviously, they're relying on that more than humans do. To the extent that they do have so much in distribution, to the extent that we have trouble distinguishing whether an example is in distribution or not.
Well, if they have everything in distribution, then they can do everything that we can do. Maybe it's not in distribution for us.
Why is it so crucial that

it has to be out of distribution for them? You know, why can't we just leverage the fact that they do get to see everything? Right. You're asking basically, what's the difference between actual intelligence, which is the ability to adapt to things you've not been prepared for, and pure memorization, like reciting what you've seen before.
And it's not just some semantic difference.

The big difference is that you can never pre-train

on everything that you might see at test time, right?

Because the world changes all the time.

So it's not just the fact that the space of possible tasks is infinite.

And even if you're trained on millions of them,

you've only seen zero person of the total space. It's also the fact that the world is changing every day.
This is why we, the human species, have developed intelligence in the first place. If there was such a thing as a distribution for the world, for the universe, for our lives, then we would not need intelligence at all.
In fact, many creatures, many insects, for instance, do not have intelligence. Instead, what they have is they have in their connectome, in their genes, hard-coded programs, behavioral programs that map some stimuli to appropriate response.
And they can actually navigate their lives, their environment, in a way that way that's very evolutionary fits that way without needing to learn anything and well if our environment was static enough predictable enough what would have happened is that evolution would have found the perfect behavioral program a hard-coded static behavioral program would have written it our genes. We would have a hard-coded brain connectome.

And that's what we would be running on.

But no, that's not what happened.

Instead, we have general intelligence.

So we are born with extremely little knowledge about the world.

But we are born with the ability to learn very efficiently

and to adapt in the face of things that we've never seen before.

And that's what makes us unique.

And that's what is really, really challenging to recreate in the face of things that we we've never seen before and that's what makes us unique and that's what's that's what is really really challenging to recreate in machines i want to

rabbit hole on that a little bit but before i do that maybe i'm going to overlay some examples of

what an arc like challenge look like for uh for the youtube audience but maybe for people listening on audio can you just describe what what what would a sample arc challenge look like sure so one puzzle, it looks kind of like an IQ test puzzle. You've got a number of demonstration input-adput pairs.
So one pair is made of two grids. So one grid shows you an input, and the second grid shows you what you should produce as a response to that input.
And you get a couple pairs like this to demonstrate the nature of the task, to demonstrate what you're supposed to do with your inputs. And then you get a new test input.
And your job is to produce the corresponding test output. You look at the demonstration pairs, and from that, you figure out what you're supposed to do, and you show that you've understood it on this new test pair.
And importantly in order to the sort of like knowledge basis that you need in order to approach these challenges is you just need core knowledge. And core knowledge is it's basically the knowledge of what makes an object, basic counting, topology symmetries that sort of thing so extremely basic knowledge LLMs for sure possess such knowledge any child possesses such knowledge and what's really interesting is that each puzzle is new so it's not something that you're going to find elsewhere on the internet, for instance.
And that means that whether it's as a human or as a machine, every puzzle, you have to approach it from scratch. You have to actually reason your way through it.
You cannot just fetch the response from your memory. So the core knowledge, one contention here is we are only now getting multimodal models who, because of the data they're trained on, are trained to do spatial reasoning.
Whereas obviously not only humans, but for billions of years of evolution, we've had our ancestors have had to learn how to understand abstract physical and spatial properties and recognize the patterns there. And so one view would be in the next year, as we gain models that are multimodal native, that isn't just a sort of second class that is an add on, but the multimodal capability is a priority, that it will understand these kinds of patterns because that's something we see natively.
Whereas right now, what Arc sees is some JSON string of 100100, and it's supposed to recognize a pattern there. And even if you showed a human such a sequence of these kinds of numbers, it would have a challenge making sense of what kind of question you're asking it.
So why wouldn't it be the case that as soon as we get multimodal models, which we're on the path to unlock right now, they're going to be so much better at archetype spatial reasoning? That's an empirical question. So I guess we're going to see the answer within a few months.
But my answer to that is, you know, our grids, they're just discrete 2D grids of symbols. They're pretty small.
Like it's not like if you flatten an image as a sequence of pixels for instance then you get something that's

actually very very difficult to parse but that's not true for arc because the grids are very small you only have 10 possible symbols so there's these two degrees that are actually very easy to flatten as sequences and transformers llams they're very good at processing the sequences in fact In fact, you can show that LLMs do fine with processing ArcLag data by simply fine-tuning LLM on some subsets of the tasks and then trying to test it on small variations of these tasks. And you see that, yeah, the LLM can encode just fine solution programs for tasks that it has seen before.
So it does not really have a problem passing the input or figuring out the program. The reason why LLMs don't do well on Arc is really just the unfamiliarity aspect.
The fact that each new task is different from every other task you cannot basically you cannot memorize the solution programs in advance you have to synthesize a new solution program on the fly for each new task and that's really what LLMs are struggling with. So before I do more devil's advocate I just want to step back and explain why I'm especially interested in having this conversation and obviously the million dollar dollar ARC prize, I'm excited to actually play out with it myself.
And hopefully the Vesuvius challenge, which was Nat Friedman's prize for solving, decoding scrolls, the winner of that, decoding the scrolls from, that were buried in the volcanoes in the Herculane library that was solved by a 22 year old who was listening to the podcast, Luke Fartor. So hopefully somebody listening will find this challenge intriguing and find a solution.
So I'm and the reason I I've had on recently a lot of people who are bullish on LLMs and I've had discussions with them before interviewing you about how do we explain the fact that LLMs don't seem to be natively performing that well on ARK. And I found their explanations somewhat contrived, and I'll try out some of the reasons on you.
But it is actually an intriguing fact that some of these problems are relatively straightforward for humans to understand, and they do struggle with them if you just input them natively. All of them are very easy for humans.
Like any smart human should be able to do 90%, 95% on ARK. A smart human.
A smart human. But even a five-year-old, so with very, very little knowledge, they could definitely do over 50%.
So let's talk about that because I agree that smart humans will do very well on this test, but the average human will probably do mediocre. Not really.
So we actually tried with average humans. This was about 85.
That was with Amazon Mechanical Turkworkers, right? That's right. I honestly don't know the demographic profile of Amazon Mechanical Turkworkers, but imagine just interacting with the platform that Amazon has set up to do remote work.
That's not the median human across the planet, I'm guessing. I mean, the broader point here being that.
So we see the spectrum in humans where humans obviously have AGI. But even within humans, you see a spectrum where some people are relatively dumber and they'll do perform work on IQ like tests.
For example, Raven's progressive matrices.

If you look at how the average person performs on that

and you look at the kind of questions

that is a sort of hit or miss,

half of people will get it right,

half of people will get it wrong.

Some of them are like pretty trivial.

For us, we might think like this is kind of trivial.

And so humans have HEI,

but from relatively small tweaks,

you can go from somebody who misses

these kinds of basic IQ test questions

to somebody who gets them all right, which suggests that actually, if these models are doing natively, we'll talk about some of the previous performances that people have tried with these models, but somebody with a jack hole with a 240 million parameter model got 35%. Doesn't that suggest that they're on this spectrum that clearly exists within humans and they're going to be saturated pretty soon? Yeah, so there's a bunch of interesting points here.
So there is indeed a branch of LLM approaches suspended by Jack Cole that are doing quite well, that are in fact state-of-the-art. But you have to look at what's going on there.
So there are two things. The first thing is that to get these numbers,

you need to pre-train your LLM on millions of generated ARC tasks. And of course, if you compare that to a five-year-old child looking at ARC for the first time, the child has never done an IQ test before.
Has never seen something like an ARC task before. The only overlap between what they know and what they have to do in the test is core knowledge, is knowing about counting and objects and symmetries and things like that.
And still, they're going to do really well. And they're going to do much better than DLLM trained on millions of similar tasks.
And the second thing that's something to note about the Jack Cole approach is one thing that's really critical to making the model work at all is test time fine tuning. And that's something that's really missing, by the way, from LLM approaches right now is that, you know, most of the time when you're using an LLM, it's just doing static inference.
The model is frozen and you're just prompting it and then you're getting an answer. So the model is not actually learning anything on the fly.
Its state is not adapting to the task at hand. And what Jacko is actually doing is that for every test problem is on the fly is fine-tuning a version of DLLM for that task.
And that's really what's unlocking performance. If you don't do that, you get like 1%, 2%.
So basically something completely negligible. And if you do test time fine tuning and you add a bunch of tricks on top, then you end up with interesting performance numbers.
So I think what it's doing is trying to address one of the key limitations of LLMs today, which is the lack of active inference. It's adding active inference to LLMs.
And that's working extremely well, actually. So that's fascinating to me.
There's so many interesting rabbit holes there. Should I take them in sequence or deal with them all at once? Let me just start.
So the point you made about the fact that you need to unlock the adapter compute slash test time compute, a lot of the scale maximalists, I think this will be interesting rabbit hole to explore with you, because a lot of the scaling maximalists have your broader perspective in the sense that they think that in addition to scaling, you need these kinds of things like unlocking adaptive compute or doing some sort of RL to get the system to working. And their perspective is that this is a relatively straightforward thing that will be added atop the representations that a scaled up model has greater access to.
No, it's not just a technical detail. It's not a straightforward thing.
It is everything. It is the important part.
And the scale maximalist argument, it boils down to, you know, these people, they refer to scaling laws, which is this empirical relationship that you can draw between how much compute you spend on training a model and the performance you're getting on benchmarks, right? And the key question here, of course, is, well, how do you measure performance? What it is that you're actually improving by adding more compute and more data? And well, it's benchmark performance, right? And the thing is, the way you measure performance is not a technical detail. It's not an afterthought because it's going to narrow down the set of questions that you're asking.
And so accordingly, it's going to narrow down the set of answers that you're looking for. If you look at the benchmarks we're using for LLMs, they're all memorization-based benchmarks.
Like sometimes they're literally just knowledge-based, like a school test. And even if you look at the ones that are explicitly about reasoning, you realize if you look closely

that in order to solve them, it's enough to memorize a finite set of reasoning patterns.

And then you just reapply them.

They're like static programs.

LLMs are very good at memorizing static programs, small static programs.

And they've got this sort of like bank of solution programs. And when you give them a new puzzle, the key is going to be

going to be able to do that.

So, you can see that.

So, you can see that. So, you can see that.
So, you of solution programs. And when you give them a new puzzle, they can just fetch the appropriate program, apply it, and it's looking like it's reasoning.
But really it's not doing any sort of on-the-fly program synthesis. All it's doing is program fetching.
So you can actually solve all these benchmarks with memorization. And so what you're scaling up here, like if you look at the models, they are big parametric curves fitted to a data distribution, which I can only descent.
So they're basically these big interpolative databases, interpolative memories. And of course, if you scale up the size of your database and you cram into it more knowledge, more patterns and so on, you are going to be increasing its performance as measured by a memorization benchmark.
That's kind of obvious. But as you're doing it, you are not increasing the intelligence of the system one bit.
You are increasing the skill of the system. You are increasing its usefulness, its scope of applicability, but not its intelligence because skill is not intelligence.
And that's the fundamental confusion that people run into is that they're confusing skill and intelligence. Yeah, there's a lot of fascinating things to talk about here.
So skill, intelligence, interpolation.

I mean, OK, so the thing about they're fitting some manifold that maps the input data.

There's a reductionist way to talk about what happens in the human brain that says that it's just axons firing at each other.

But we don't care about the reductionist explanation of what's happening.

We care about what the sort of meta at the macroscopic level, what happens when these things combine. As far as the interpolation goes, so, okay, let's look at one of the benchmarks here.
There's one benchmark that does great school math. And these are problems that like a smart high schooler would be able to solve.
It's called GSM 8K. And these models get 95% on these.
Like basically, they always nail it. That's the memorization benchmark.
Okay, let's talk about what that means. So here's one question from that benchmark.
So 30 students are in a class. One fifth of them are 12-year-olds.
One third are 13-year-old. One tenth are 11-year-olds.
How many of them are not 11, 12, or 13 years old? So I agree, like this is not rocket science, right? You can write down on paper how you go through this problem and a high school kid, at least a smart high school kid should be able to solve it. Now, when you say memorization, it still has to reason through how to think about fractions and what is the context of the whole problem and then combining the different calculations it's doing.
It depends how you want to define a reasoning. But there are two definitions you can use.
So one is I have available a set of program templates. It's like the structure of the puzzle, which can also generate its solution.
And I'm just going to identify the right template, which is in my memory. I'm going to input the new values into the template, run the program, get the solution.
And you could say this is reasoning. And I say, yeah, sure, OK.
But another definition you can use is reasoning is the ability to, when you're faced with a puzzle, given that you don't have already a program in memory to solve it, you must synthesize on the fly a new program based on bits and pieces of existing programs that you have. You have to do on the fly program synthesis.
And it's actually dramatically harder than just fetching the right memorized program and reapplying it. So I think maybe we are overestimating the extent to which humans are so sample efficient.
They also don't need training in this way where they have to drill in these kinds of pathways of reasoning through certain kinds of problems so let's take math for example yeah it's not like you can just show a baby the axioms of set theory and now they know math right so they when they're growing up you had to do years of teaching them pre-algebra then you got to do a year of teaching them doing drills and going through the same kind of problem in algebra, then geometry, pre-calculus, calculus. Absolutely.
So training. Yeah.
But isn't that like the same kind of thing where you can't just see one example and now you have the program or whatever. You actually had to drill it.
These models also had to drill with a bunch of free training data. Sure.
I mean, in order to do on-the-fly program synthesis, you actually need building blocks to work from. So knowledge and memory are actually tremendously important in the process.
I'm not saying it's memory versus reasoning. In order to do effective reasoning, you need memory.
But it sounds like it's compatible with your story that through seeing a lot of different kinds of examples, these things can learn to reason within the context of those examples. And we can also see within bigger and bigger models.
So that was an example of a high school level math problem. Let's say a model that's like smaller than GPT-3 couldn't do that at all.
As these models get bigger, they seem to be able to pick up bigger and bigger. It's not really a size issue.
It's more like a trained data issue in this case. Well, bigger models can pick up these kinds of circuits, which smaller models apparently don't do a good job of doing this, even if you were to train them on this kind of data.
Doesn't that just suggest that as you have bigger and bigger models, they can pick up bigger and bigger pathways or more general ways of reasoning? Absolutely. But then isn't that intelligence? No, no, it's not.
If you scale up your database and you keep adding to it more knowledge, more program templates, then sure, it becomes more and more skillful. You can apply to more and more tasks.
But general intelligence is not task-specific skill scaled up to many skills. Because there is an infinite space of possible skills.
General intelligence is the ability to approach any problem, any skill, and very quickly master it using valid or data. Because this is what makes you able to face anything you might ever encounter.
This is the definition of generality. Like, generality is not specificity scaled up.
It is the ability to apply your mind to anything at all, to arbitrary things. And this requires, fundamentally, this requires the ability to adapt, to learn on the fly efficiently.
So my claim is that by doing this free training on bigger and bigger models, you are gaining that capacity to then generalize very efficiently. Let me give you an example.
Let me give you an

example. So your own company, Google, in their paper on Gemini 1.5, they had this very interesting example where they would give in context, they would give the model the grammar book and the dictionary of a language that has less than 200 living speakers.
So it's not in the pre-training data and you just give them the dictionary and it basically is able to speak this language and translate to it, including the complex and organic ways in which languages are structured. So a human, if you showed me a dictionary from like English to Spanish, I'm not going to be able to pick up the how to structure sentences and how to say things in Spanish.
The fact that because of the representations that it has gained through this pre-training, it is able to now extremely efficiently learn a new language. Doesn't that show that this kind of pre-training actually does increase your ability to learn new tasks? If you're right, if you were right, LLMs would do really well on arc puzzles because arc puzzles are not complex.
Each one of them requires very little knowledge. Each one of them is very low on complexity.
You don't need to think very hard about it. They're actually extremely obvious for humans.
Like even children can do them. But LLMs cannot.
Even LLMs that have 100,000 times more knowledge than you do, they still cannot. And the only thing that makes arc special is that it was designed with this intent to resist memorization.
This is the only thing and this is the huge blocker for LLM performance. And so you know I think if you look at LLMs closely it's pretty obvious that they're not really like synthesizing new programs on the fly to solve the tasks that they're faced with.
They're very much replying things that they've stored in memory. For instance, one thing that's very striking is LLMs can solve a Caesar cipher, you know like a Caesar cipher, like transposing letters to code a message.
And well,, there's a very complex algorithm, right? But it comes up quite a bit on the internet. So they've basically memorized it.
And what's really interesting is that they can do it for a transposition length of like three or five because there are very, very common numbers in examples provided on the internet. But if you try to do it with an arbitrary number like 9, it's going to fail.
Because it does not encode the generalized form of the algorithm, but only specific cases. It has memorized specific cases of the algorithm.
And if it could actually synthesize on the fly the solver algorithm, then the value of n would not matter at all. Because it does not increase the problem complexity.
I think this is true of humans as well. What was the study that...
Humans use memorization pattern matching all the time, of course, but humans are not limited to memorization pattern matching. They have this very unique ability to adapt to new situations on the fly.
This is exactly what enables you to navigate every new day in your life. I'm forgetting the details, but there was some study that chess grandmasters will perform very well within the context of the moves that...
Excellent example, because chess at the highest level is all about memorization. Chess is memorization.
Okay, sure. We can leave that aside.
What is your explanation for the original question of why in context the GPT-1, sorry, Gemini 1.5 was able to learn a language, including the complex grammar structure. Doesn't that show that they can pick up new knowledge? I would assume that it has simply mined from its extremely extensive, unimaginably vast training data.
It has mined the required template and then it's just reusing it. We know that LLMs have very poor ability to synthesize new program templates like this on the fly or even adapt existing ones.
They're very much limited to fetching. Suppose there's a programmer at Google.
They go into the office in the morning. At what point are they doing something that 100% cannot be due to fetching some template that even if they, suppose they were an LLM, they they could not do if they had fetched some template from their program.
Like at what point do they have to use this so-called extreme generalization capability? Forget about Google software developers. Every human, every day of their lives is full of novel things that they've not been prepared for.
You cannot navigate your life based on memorization alone. It's impossible.
I'm sort of denying the premise that you also agree they're not doing like quote unquote

memorization.

It seems like you're saying they're less capable of generalization.

But I'm just curious of like the kind of generalization they do.

If you get into the office and you try to do this kind of generalization, you're going

to fail at your job.

What is the first point?

You're a programmer.

What is the first point when you try to do that generalization? You would lose your job because you can't do the extreme generalization. I don't have any specific examples, but literally, like, take this situation, for instance.
You've never been here in this room. Maybe you've been in this city a few times.
I don't know, but there's a fair amount of novelty.

You've never been interviewing me. There's a fair amount of novelty in every hour of every day in your life.
And it's in fact, by and large, more novelty than any LLM could handle. Like if you just put a LLM in a robot, it could not be doing all the things that you've been doing today.
Right.

Or take, I don't know, like cell driving cars, for instance.

You take a cell-driving car operating in the barrier do you think you could just drop it in New York City or drop it in London where people drive on the left no it's gonna fail so not only can you drop not like make it generalize to a change of rules of driving walls, but you cannot even make it generalized to a change of rules, of driving rules, but you can not even make it generalized to a new city. It needs to be trained on each specific environment.
I mean, I agree that self-driving cars aren't AGI. But it's the same type of model.
They are transformers as well. I mean, I don't know.
Apes also have brains with neurons in them, but they're less intelligent because they're smaller. It's not the same architecture.
We can get into that. But so I still don't understand like a concrete thing of, we also need training.
That's why education exists. That's why we had to spend the first 18 years of our life doing drills.
We have a memory, but we are not a memory. We are not limited to just a memory.
But I'm denying the premise that that's necessarily the only thing these models are doing, and I'm still not sure what is the task that a remote worker would be doing, have to, like, suppose you just stepped out of remote work with an LLM, and they're a programmer. What is the first point at which you realize this is not a human, this is an LLM? What about I just send them an arc puzzle and see how they do? No, like part of their job, you know.

But you have to deal with novelty all the time.

Okay, so if you, is there a world in which all the programmers are replaced

and then we're still saying,

ah, but they're only doing memorization-laden programming tasks,

but they're still producing a trillion dollars worth of output in the form of code?

Software development is actually a pretty good example of a job where you're dealing with novelty all the time. Or if you're not, well, I'm not sure what you're doing.
So I personally use generative AI very little in my software development job. And before LLM Swarthing, I was also using Stack Overflow very little.
You know, some people maybe are just copy-pasting stuff from Stack Overflow or nowadays copy-pasting stuff from an LLM. Personally, I try to focus on problem-solving.
The syntax is just a technical detail. What's really important is the problem-solving.
The essence of programming is engineering mental models, like mental representations of the problem you're trying to solve. But you can, you know, we have many people can interact with these systems themselves and you can go to ChatGPT and say, here's a specification of the kind of program I want.
They'll build it for you. As long as there are many examples of this program on like GitHub and Stack Overflow and so on, sure, they will fetch the program for you from their memory.
But you can change arbitrary details. No, it doesn't work.
You can say, I need it to work on this different kind of server. If that were true, there would be no software engineers to that.
I agree we're not at a full AGI yet in the sense that these models have, let's say, less than a trillion parameters. A human brain has somewhere on the order of 10 to 30 trillion synapses.
I mean, if you were just doing some naive math, you're like at least 10x underparameterized. So I agree we're not there yet, but I'm sort of confused on why we're not on the spectrum where, yes, I agree that there's many kinds of generalization they can't do, but it seems like they're on this kind of smooth spectrum that we see even within humans, where some humans would have a hard time doing an arc-type test.
We see that based on the performance on progressive Ravens matrices type IQ tests. I'm not a fan of IQ tests because for the most part, you can train on IQ tests and get better at them.
So they're very much memorization-based. And this is actually the main pitfall that Arc tries not to fall for.
I'm still not confused. So if all remote jobs are automated in the next five years, let's say, at least that don't require you to be like sort of a service.
It's not like a salesperson where you want the human to be talking, but like programming, whatever. In that world, would you say that that's not possible because a lot of what a programmer needs to do definitely requires things that would not be in any free training corpus? Sure.
I mean, in five years, there will be more software engineers than there are today. Right.
But I just want to understand. So I'm still not sure.
I mean, I know how to, I studied computer science. If I had become a code monkey out of college, like what would I be doing? I go to my job.
What is the first thing my boss tells me something to do? When does he realize I'm an LLM if I was an LLM?

Probably on the first day, you know. Again,

if it were true that LLMs could generalize to novel problems like this and actually develop software to solve a problem they've never seen before, you would not need software engineers anymore. In practice, if I look at how people are using LLMs in their software engineering job today, they are using it as a stack overflow replacement.
So they are using it as a way to copy-paste code snippets to perform very common actions. And what they actually need is a database of code snippets.
They don't actually need any of the abilities that actually make them software engineers. I mean, when we talk about interpolating between Stack Overflow databases, if you look at the kinds of math problems or coding problems, maybe to say that they're...
Maybe let's step back on interpolation and let me ask the question this way. Why isn't creativity just interpolation in a higher dimension where if a bigger model can learn a more complex manifold, if we're going to use the ML language.
And if you look at, read a biography of a scientist, right, it doesn't feel like they're not zero shotting new scientific theories. They're playing with existing ideas.
They're trying to juxtapose them in their head. They try out some like slightly in the tree of evolution, intellectual descendants.
They try out a different evolutionary path. You sort of run the experiment there in terms of publishing the paper, whatever.
It seems like a similar kind of thing humans are doing. There's like at a higher level of generalization.
And what you see across bigger and bigger models is they can they seem to be approaching higher and higher level generalization where GPT-2 couldn't do a grade school level math problem that requires more generalization than it has capability for, even that skill, than GPT-3 and 4 can. So not quite.
So GPT-4 has a higher degree of skill and a higher range of skills, which has the same degree of generalization. I don't want to get into semantics here, but the question of why can't creativity be just interpolation on a higher dimension? I think interpolation can be creative, absolutely.
And, you know, to your point, I do think that on some level, humans also do a lot of memorization, a lot of reciting, a lot of pattern matching, a lot of interpolation as well. So it's very much a spectrum between pattern matching and true reasoning.
It's a spectrum and humans are never really at one end of the spectrum. They're never really doing pure pattern matching or pure reasoning.
They're usually doing some mixture of both. Even if you're doing something that seems very reasoning heavy like proving a mathematical theorem, as you're doing it sure, you're doing quite a bit of discrete search in your mind, quite a bit of actual reasoning, but you're also very much guided by intuition, guided by the matching, guided by the shape of proofs that you've seen before, by your knowledge of mathematics.
So it's never really, you know, all of our thoughts, everything we do is a mixture of this sort of like interpolative memorization-based thinking, this sort of like type one thinking and type two thinking. Why are bigger models more sample efficient? Because they have more reusable building blocks that they can lean on to pick up new patterns in their training data.
And does that pattern keep continuing as you keep getting bigger and bigger? To the extent that the new patterns you're giving the model to learn are a good match for what it has learned before. If you present something that is actually novel, that is not in a state distribution, like an arc puzzle, for instance, it will fail.
Let me make this claim. The program synthesis, I think, is a very, very useful intuition pump.
Why can't it be the case that what's happening in the transformer is the early layers are doing the figuring out how to represent the inputting tokens. And what the middle layers do is this kind of program search, program synthesis, where they combine the inputs to the, you know, all the circuits in the model where they go from the low level representation to a higher level representation near the middle of the model.
They use these programs and they combine these concepts. Then what comes out the other end is the reasoning based on that high level intelligence.
Possibly. Why not? But, you know, if these models were actually capable of synthesizing novel programs, however simple, they should be able to do ARK.
Because for any ARK task, if you write down the solution program in Python, it's not a complex program. It's extremely simple.
And humans can figure it out. So why can LLMs not do it? Okay, I think that's a fair point.
And if I turn the question around to you, so suppose that it's the case that in a year, a multimodal model can solve ARC, let's say get 80%, whatever the average human would get, then AGI? Quite possibly, yes. I think if you start...
So honestly, what I would like to see is an LLM type model solving ARC at like 80%, but after having only been trained on core knowledge-related stuff. But human kids, I don't think we're necessarily just traded on.
It's not just that we have in our genes object permanents. Let me rephrase that.
Only trained on information that is not explicitly trying to anticipate what's going to be in the ARC test set. But isn't the whole point of ARC that you can't sort of, it's a new type of intelligence test every single time? Yes, that is the point.
So if ARC were a perfect, flawless benchmark, it would be impossible to anticipate what's in the test set.

And, you know, Arc was released more than four years ago, and so far it's been resistant to memorization. So I think it has, to some extent, passed the test of time.
But I don't think it's perfect. I think if you try to make by hand hundreds of thousands of Arc tasks, and then you try to multiply them by programmatically generating variations, and then you end up with maybe hundreds of millions of tasks.
Just by brute forcing the task space, there will be enough overlap between what you're trained on and what's in the test set that you can actually score very highly. So, you know, with enough scale, you can always cheat.
If you can do this for every single thing that supposedly requires intelligence, then what good is intelligence? Apparently, you can just brute force intelligence. If the world, if your life were a static distribution, then sure, you could just brute force the space of possible behaviors.
Like, you know, the way we think about intelligence, there are several metaphors I like to use, but one of them is you can think of intelligence as a path-finding algorithm in future situation space. Like, I don't know if you're familiar with game development, like RTS game development, but you have a map, right? And you have, it's like a 2D map, and you have partial information about it.
Like there is is some fog of war on your map there are areas that you haven't explored yet you know nothing about them and then there are areas that you've explored but you only know how they were like in the past you don't know how they are like today and now instead of thinking about a 2D map think about the space of possible future situations that you might encounter and how they're connected to each other. Intelligence is a pathfinding algorithm.
So once you set a goal, it will tell you how to get there optimally. But of course, it's constrained by the information you have.
It cannot pathfine in an area that you know nothing about. It cannot also anticipate changes and the thing is if you had complete information about the map then you could solve the pathfinding problem by simply memorizing every possible path, every mapping from point A to point B you You could solve the problem with pure memory.
But the reason you cannot do that in real life is because you don't actually know what's going to happen in the future. Life is ever-changing.
I feel like you're using words really memorization, which we would never use for human children. If you're like, your kid learns to do algebra and then now learns to do calculus, you wouldn't say they've memorized calculus.
If they can just solve any arbitrary algebraic problem, you wouldn't say like they've memorized algebra. They'd say they've learned algebra.
Humans are never really doing pure memorization or pure reasoning. But that's only because you're semantically labeling when the human does a skill, it's a memorization when the exact same skill is done by the LLM as you can measure by these benchmarks and you can just like plug in any sort of math problem.
Sometimes humans are doing the exact same as the LLM is doing, which is just, for instance, I know if you learn to add numbers, you're memorizing an algorithm, you're memorizing a program and then you can reapply it. You are not synthesizing on the fly the addition program.
So obviously at some point, some human had to figure out how to do addition, but like the way a kid learns it is not that they sort of figure out from the ax of set theory how to do addition i think what you learn in school is mostly memorization right yeah so my claim is that listen these models are vastly under parameterized relative to how many flops or how many parameters you have in the human brain and so yeah they're not going to be like coming up with new theorems like the smartest humans can but most humans can't do that either. What most humans do, it sounds like it's similar to what you are calling memorization, which is memorizing skills or memorizing, you know, techniques that you've learned.
And so it sounds like it's compatible. And tell me if this is wrong.
Is it compatible in your world if like all the remote workers are gone, but they're doing skills which we can potentially make synthetic data off? So we record everybody's screen and every single remote worker screen. We sort of understand the skills they're performing there.
And now we've trained a model that can do all this. All the remote workers are unemployed.
We're generating trillions of dollars of economic activity from AI remote workers. In that world, are we still in the memorization regime? So sure.
With memorization, you can automate almost anything, as long as it's a static distribution, as long as you don't have to deal with change. Are most jobs part of such a static distribution? Potentially, there are lots of things that you can automate.
And LLMs are an excellent tool for automation. But you have to understand that automation is not the same as intelligence.
I'm not saying that LLMs are useless. I've been a huge proponent of deep learning for many years.
And, you know, for many years, I've been saying two things. I've been saying that if you keep scaling up deep learning, it will keep paying off.
And at the same time, I've been saying, if you keep scaling up deep learning, this will not lead to a GI. So we can automate more and more things and yes this is economically valuable and yes potentially there are many jobs you could automate a way like this and that would be economically valuable.
But you're not still not going to have intelligence. So you can ask you know okay so what does it matter if we can generate all this economic value? Maybe you don't need intelligence after all.
Well you need intelligence the moment you have to deal with change, with novelty, with uncertainty. As long as you're in a space that can be exactly described in advance, you can just make pure memorization.
In fact, you can always solve any problem. You can always display arbitrary levels of skills on any task without leveraging any intelligence whatsoever, as long as it is possible to describe the problem and its solution very, very precisely.
But when they do deal with novelty, then you just call it interpolation, right? And so... No, no, no.
Interpolation is not enough to deal with all kinds of novelty. If it were, then LLMs would be a GI.
Well, I agree they're not a GI. I'm just trying to figure out how do we figure out we're on the path to a GI.
And I think sort of crux here is maybe that it seems to me that these things are on a spectrum and we're clearly covering the earliest part of the spectrum with LLMs. I think so.
And, okay, interesting. But here's another sort of thing that I think is evidence for this.
Grokking, right? So clearly, even within deep learning, there's a difference between the memorization regime and the generalization regime, where at first they'll just memorize the data set of, you know, if you're doing modular addition, how to add digits. And then at some point, if you keep training on that, they'll learn the skill.
So the fact that there is that distinction suggests that the generalized circuit, the deep learning can learn, there is a regime it enters where it generalizes. If you have an over-parameterized model, which you don't have in comparison to all the tasks we want these models to do right now.
Glocking is a very, very old phenomenon. We've been observing it for decades.

It's basically an instance of the minimum description length principle, where sure, given a problem, you can just memorize a point-wise input-to-output mapping, which is completely overfit. So it does not generalize at all, but it solves the problem on the train data.
And from there, you can actually keep pruning it, keep making your mapping simpler and simpler and more compressed. And at some point, it will start generalizing.
And so that's something called the minimum description length principle. It's this idea that the program that will generalize best is the shortest.
Right? And it doesn't mean that you're doing anything other than memorization, but you're doing memorization plus regularization. Right.
A.k.a. generalization.
Yeah, and that leads absolutely to generalization. Right, and so you do that within one skill, but then the pattern you see here of meta-learning is that it's more efficient to store a program that can perform many skills rather than one skill, which is what we might call fluid intelligence.
And so as you get bigger in models, you would expect it to go up this hierarchy of generalization where it generalizes to a skill, then it generalizes across multiple skills. That's correct.
That's correct. And, you know, LLMs, they're not infinitely large.
They have only a fixed number of parameters. And so they have to compress their knowledge as much as possible.
And in practice, LLMs are mostly storing reusable bits of programs, like vector programs. And because they have this need for compression, it means that every time they're learning a new program, they're going to try to express it in terms of existing bits and pieces of programs that they've already learned before.
Isn't this the generalization? Absolutely. Oh, wait, so...
This is what, you know, clearly LLMs have some degree of generalization. And this is precisely why.
It's because they have to compress. And why is that intrinsically limited? Why can't you just go, at some point it has to learn a higher level of generalization, a higher level, and then the highest level is the fluid intelligence? It's intrinsically limited because the substrate of your model is a big parametric curve.
And all you can do with this is local generalization. If you want to go beyond this towards broader or an extreme generalization, you have to move to a different type of model.
And my paradigm of choice is discrete program search, program synthesis. And if you want to understand that, you can sort of like compare it, contrast it with deep learning.
So in deep learning, your model is a parametric, a differentiable parametric curve. In program synthesis, your model is a discrete graph of operators.
So you've got like a set of logical operators, like a domain-specific language. You're picking instances of it.
You're structuring that into a graph that's a program. And that's actually very similar to like a program you might write in Python or C++ and so on.
And in deep learning your learning engine, because we are doing machine learning here, like we're trying to automatically learn these models, in deep learning your learning engine is gradient descent. And gradient descent is very compute efficient because you have this very strong informative feedback signal about where the solution is so So you can get to the solution very quickly.
But it is very data inefficient, meaning that in order to make it work, you need a dense sampling of the operating space. You need a dense sampling of the data distribution.
And then you're limited to only generalizing within that data distribution. And the reason why you have this limitation is because your model is a curve.
And meanwhile, if you look at discrete program search, the learning engine is combinatorial research. You're just trying a bunch of programs until you find one that actually meets your spec.
This process is extremely data efficient. You can learn a generalizable program from just one example, two examples, which is why it works so well on Arc, by the way.
But the big limitation is that it's extremely compute inefficient because you're running into a commutatorial explosion, of course. And so you can sort of see here how deep learning and discrete program search, they have very complementary strengths and limitations as well.

Like every limitation of deep learning has a strength,

a corresponding strength in program synthesis and inversely.

And I think the path forward is going to be to merge the two,

to basically start doing.

So another way you can think about it is,

so these parametric curves,

trend with ground descent,

they are a great fit for everything that's system one type thinking like pattern cognition, intuition, memorization and so on. And discrete program search is a great fit for a type two thinking, system two thinking.
For instance, planning, reasoning, quickly figuring out a generalizable model that matches just one or two examples, like for an arc puzzle, for instance. And I think humans are never doing pure system one or pure system two.
They're always mixing and matching both. And right now we have all the tools for system one.
We have almost nothing for system two. The way forward is to create a hybrid system.
And I think the form it's going to take is it's going to be mostly system two. So the outer structure is going to be a discrete program search system.
But you're going to fix the fundamental limitation of discrete program search, which is combinator explosion. You're going to fix it with deep learning.
You're going to leverage deep learning to guide, to provide intuition in program space to guide the program search. And I think that's very similar to what you see, for instance, when you're playing chess or when you're trying to prove a theorem, is that it's mostly a reasoning thing, but you start out with some intuition about the shape of the solution.
And that's very much something you can get via a deep-planning model. Deep-planning models, they're very much like intuition machines.
They're pattern-matching machines. So you start from this shape of the solution and then you're going to do actual explicit discrete program search.
But you're not going to do it via brute force. You're not going to try things kind of like randomly.
You're actually going to ask another deep learning model for suggestions. Like here's the best likely next step.
Here's where in the graph you should be going. And you can also use yet another deep learning model for feedback.
But well, here's what I have so far. Is it looking good? Should I just backtrack and try something new? So I think discrete program search is going to be the key, but you want to make it dramatically better, orders of magnitude more efficient by efficient by leveraging deep learning and by the way another thing that you can use deep learning for is of course things like common sense knowledge and knowledge in general and I think you're going to end up with this sort of system where you have this on the fly synthesis engine that can adapt to new situations But the way it adapts is that it's going to fetch from a bank of patterns modules that could be themselves curves that could be differentiable modules and some others that could be algorithmic in nature.
It's going to assemble them via this process that's intuition-guided. And it's going to give you, for every new situation you might be faced with, it's going to give you a generalizable model that was synthesized using very, very little data.
Something like this would sort of arc. That's actually a really interesting prompt because I think an interesting crux here is when I talk to my friends who are extremely optimistic about LLMs and expect AGI within the next couple of years, they also in some sense agree that scaling is not all you need, but that the rest of the progress is undergirded and enabled by scaling.
But still, you need to add the system to the test time compute atop these models. And their perspective is that it's relatively straightforward to do that because you have this library of representations that you built up from free training.
But it's almost talking like, you know, it's just like skimming through textbooks. You need some more deliberate way in which it engages with the material it learns.
In context, learning is extremely sample efficient. But to actually distill that into the weights, you need the model to like talk through the things it sees and then add it back to the weights.
As far as the system two goes, they talk about adding some kind of RL setup so that it is encouraged to proceed on the reasoning traces that end up being correct. And they think this is relatively straightforward stuff that will be added within the next couple of years.
That's an empirical question. So I think we'll see.
Your intuition, I assume, is not that. I'm curious why.
My intuition is, in fact, this whole system to architecture is the hard part, is the very hard and obvious part. Scaling up the interpretive memory is the easy part.
All you need is, like it's literally just a big curve. All you need is more data.
It's a representation of a data set, an interpretive representation of a data set. That's the easy part.
The hard part is the architecture of intelligence. Memory and intelligence are separate components.
We have the memory. We don't have the intelligence yet.
And I agree with you that, well, having the memory is actually very useful. And if you just had the intelligence, but it was not hooked up to an extensive memory, it would not be that useful because it would not have enough material to work from.
Yeah. The alternative hypothesis here that a former guest Trent and Bricken advanced is that intelligence is just hierarchically associated memory where higher level patterns.
When Sherlock Holmes goes into a crime scene and he's extremely sample efficient, he can just like look at a few clues and figure out who was a murderer. And the way he's able to do that is he has learned higher, higher level sort of associations.
It's memory in some fundamental sense. But so here's one way to ask the question.
In the brain, supposedly we do program synthesis, but it is just synapses connected to each other. And so physically, it's got to be that you just query the right circuit, right? You are.
Yeah, yeah, yeah. It's a matter of degree.
But if you can learn it, if, you know, training in the environment that the human ancestors are trained in means you learn those circuits, training on the same kinds of outputs that humans produce, which to replicate require these kinds of circuits, wouldn't that train the same kind of whatever humans have? You know, it's a matter of degree. If you have a system that has a memory and is only capable of doing local generalization from that, it's not going to be very adaptable.
To be really general, you need the memory plus the ability to search to quite some depth to achieve, you know, broader, even extreme generalization. You know, like one of Jean Piaget, who is the founder of developmental psychology, he had a very good quote about intelligence.
He said, intelligence is what you use when you don't know what to do. And it's like as a human living your life, in most situations you already know what to do because you've been in this situation before.

You already have the answer. And you're only going to need to use intelligence when you're faced with novelty, with something you didn't expect, with something that you weren't prepared for either by your own experience, your own life experience, or by your evolutionary history.
Like this day that you're living right now is different in some important ways from every day you've lived before, but it's also different from any day ever lived by any of your ancestors. And still, you're capable of being functional, right? How is that possible? I'm not denying that generalization is extremely important and is the basis for intelligence.
That's not the crux. The crux is like how much of that is happening in the models.
But OK, let me ask a separate question. We might keep going in the circle here.
The differences in intelligence between humans, maybe the intelligence test because of reasons you mentioned are not measuring it well. But clearly, there's differences in intelligence between different humans.
Sure. What is your explanation for what's going on there? Because I think that's sort of compatible with my story that there's a spectrum of generality and that these models are climbing up to a human level.
And even some humans haven't even climbed up to the Einstein level or the Francois level. That's a great question.
You know, there is extensive evidence that intelligence, difference in intelligence are mostly genetic in nature.

Meaning that if you take someone who is not very intelligent, there is no amount of training,

of training data you can expose that person to that would make them become Einstein.

And this kind of points to the fact that you really need a better architecture. You need a better algorithm.
And more training data is not, in fact, all you need. I think I agree with that.
I think maybe the way I might phrase it is that the people who are smarter have, in ML language, better initializations. The neural wiring, if you just look at it, it's more efficient.
They have maybe greater density of firing. And so some part of the story scaling, there is some correlation between brain size and intelligence.

And we also see within the context of quote unquote scaling that people talk about within the context of LLMs, architectural improvements where a model like Gemini 1.5 Flash is performed as well as GPT-4 did when GPT-4 was released a year ago, but is 57 times cheaper on output. So part of the scaling story is that the architectural improvements are, we're in like extremely low-hanging fruit territory when it comes to those.
Okay, we're back now with the co-founder of Zapier, Mike Knouf. We had to restart a few times there.
And you're funding this prize and you're running this prize with Francois. And so tell me about how this came together.
What prompted you guys to launch this prize? Yeah. I guess I've been sort of like AI curious for 13 years.
I've been, I co-founded Zapier, been running it for the last 13 years years and i think i first got introduced to your work in during covid um i kind of went down the rabbit hole i had a lot of free time um and it was right after you um published your on measure of intelligence paper you sort of introduced the concept of agi this like efficiency of skill acquisition is like the right definition and the arc puzzles but i don't think the first Kaggle contest was done yet i think it was still running and so i kind of it was interesting um but i just parked the idea uh and my bigger fish to fry at zapier we're in this middle of this big turnaround of trying to get to our second product um and then uh it was january 2022 when the chain of thought paper came out that really like awoken me to sort of the progress i i gave a whole to the Zapier on like the GPT-3 paper even so I sort of felt like I had priced in everything that Elms could do and that paper was really shocking to me in terms of oh these these latent capabilities that Elms have that um I didn't expect that they had and so I actually gave up my uh exec team role at Zapier I was running half the company at that point I went back to be an individual contributor and just to go do AI research alongside Brian, my co-founder. And ultimately, that led me to back towards Arc.
I was looking into it again, and I had sort of expected to see this saturation effect that MMLU has, that GMSK 8K has. And when I looked at the scores and the progress since the last four years, I was really, again, shocked to see actually we've made very little objective progress towards it.
And it felt very, it felt like a really, really important eval. And as I sort of spent the last year asking people, quizzing people about it in sort of my networking community, very few people even knew it existed.
And that felt like, okay, if it's right that this is a really, really like globally singularly unique AGI eval, and it's different from every other eval that exists that are more narrowly measures AI skill, like more people should know about this thing. I had my own ideas on how to beat the arc as well.
So I'm like, I was working on nights and weekends on that. And I flew up to meet Francois earlier this year to sort of quiz him, show him my ideas.
And ultimately, I was like, well, you know, why don't you think more people know about ARC? I think you should actually answer that. I think it's a really interesting question.
Like, why don't you think more people know about ARC? Sure. You know, I think benchmarks that gain traction in the research community are benchmarks that are already fairly tractable.
Because the dynamic that you see is that some research group is going to make some initial breakthrough, and then this is going to catch the attention of everyone else. And so you're going to get follow-up papers with people trying to beat the first team and so on.
And for ARK, this has not really happened because ARK is actually very hard for existing AI techniques. ARK requires you to try new ideas.
And that's very much the point, by the way. The point is not that, yeah, you should just be able to apply existing technology and solve ARC.
The point is that existing technology has reached a plateau. And if you want to go beyond that, if you want to start being able to tackle problems that you haven't memorized, you haven't seen before you need to try new ideas and arc is not just meant to be this sort of like measure of how close we are to a gi it's also meant to be a source of inspiration like i want i want researchers to look at these puzzles and be like hey it's really strange that these puzzles are so simple and most humans can just do them very quickly.
Why is it so hard for existing AI systems? Why is it so hard for LLMs and so on? This is true for LLMs, but ARC was actually released before LLMs were really a thing. And the only thing that made it special at the time was that it was designed to be resistant to memorization.

And the fact that it has survived

LLMs and Gen AI in general

so well kind of shows

that yes, it is actually resistant to

memorization. This is what nerds meant me

because I went and took a bunch of the puzzles myself.

I've showed it to all my friends and family too.

And they're all like, oh yeah, this is

super easy. Are you sure AI can't solve this? That that's the reaction in the same one for me as well and the more you dig in you're like okay yep there's not just empirical evidence over the last four years that it's unbeaten but there's theoretical like concepts behind why um and I completely agree at this point that like new ideas basically are needed to be dark and there's a lot of current trends in the the world that are actually, I think, working against that happening.
Basically, I think we're actually less likely to generate new ideas right now. You know, I think one of the kind of trends is the closing up frontier research, right? The GPT-4 paper from opening, I had no technical detail shared.
The Gemini paper had no technical detail shared and like the longer context part of that work. And, and yet that open innovation and open progress and sharing is what got us to transformers in the first place.
That's what got us to LMS in the first place. So it's, it's kind of disappointing a little bit actually that like so much frontier work has gone closed.
It's really making a bet that like these individual labs are going to be the, have the breakthrough and not the ecosystem is going to have the breakthrough. And I think sort of the internet open source has shown that that's like the most powerful innovation ecosystem that's ever existed probably in the entire world.
I think that's actually really sad that frontier research is no longer being published. If you look back four years ago, well, everything was just openly shared.
Like all the state-of-the-art results were published. And this is no longer the case.
And it's very much, you know, OpenAI single-handedly changed the game. And I think OpenAI basically set back progress towards AGI by quite a few years, probably like five to ten years, for two reasons.
And one is that, well, they caused this complete closing down of research, frontier research publishing. But also, they trigger this initial burst of hype around LLMs.
And now LLMs have sucked the oxygen out of the room. Like everything, everyone is just doing LLMs.
And I see LLMs as more of an off-ramp on the path to AGR, actually. And all these new resources, they're actually going to LLMs as more of an off-ramp on the path to AGR actually.

And all these new resources, they're actually going to LLMs instead of everything else they could be going to.

And, you know, if you look further into the past to like 2015, 2016,

there were like a thousand times fewer people doing AI back then.

And yet I feel like the rate of progress was higher because people were exploring more directions. The world felt more open-ended, like you could just go and try, like have a cool idea of a launch and try it and get some interesting results.
So there was this energy. And now everyone is very much doing some variation of the same thing.
And big labs also tried their handle arc but because they got bad results they didn't publish anything like you know people only publish uh positive results um i wonder how much effort people have put into trying to prompt or scaffold do some sort of maybe devon type approach into getting the frontier models and the frontier models of today not just a year ago because a lot of post training has gone into making them better so cloud through opus or gpt4o into getting good solutions on arc i um i hope that one of the things this episode does is get people to try out this open competition where they have to put in an open source model to compete but also to like figure out if there maybe the late capability is latent in cloud opus and just see if you can show that i think that would be super interesting so let's talk about the prize how much do you win if you solve it you know get whatever percent on arc how much do you get if you get the best submission, but don't crack it? So we got a million dollar, actually a little over a million dollars of the prize pool. We're running the contest on an annual basis.
We're going to, we're starting it today through the middle of November. And the goal is to get 85%.
That's the lower bound and human average that you guys talked about earlier. And there's a $500,000 prize for the first team that can get to the 85

benchmark we're also going to run we don't expect that to happen this year actually um one of the

early um statisticians at zapier gave me this line that has always stuck with me uh that the the longer it takes the longer it takes so my prior is that like arc is going to take years to solve um and so we're going to keep we're also going to break down and do a progress price this this year. So there's a hundred thousand dollar progress price, which we will pay out to the top scores.
So $50,000 is going to go to the top objective scores this year on the Kaggle leaderboard, which is we're hosting it on Kaggle. And then we're going to have a $50,000 pot set for a paper award for the best paper that explains conceptually the scores that they were able to achieve.
And one of the, I think, interesting things we're also going to be doing is we're going to be requiring that in order to win the prize money, that you put the solution or your paper out into public domain. The reason for this is, you know, typically with contests, you see a lot of like closed up sharing.
People are kind of private, secret. They want to hold their outfit of themselves during the contest period.
And because we expect it's going to be multiple years, we want to enter a game here. So the plan is, you know, at the end of November, we will award the $100,000 prize money to the top progress prize and then use the downtime between December, January, February to share out all the knowledge from the top scores and the approaches folks were taking in order to rebbaseline the community up to whatever the state of the art is and then run the contest again next year.
And keep doing that on a yearly basis until we get 85%. I'll give some people some context on why I think this prize is very interesting.
I was having conversations with my friends who are very much believers in models as they exist today. And first of all, it was intriguing to me that they didn't know about ARK.
These are experienced ML researchers. And so you show them the, this is, this happened a couple of nights ago, we went to dinner and I showed them an example problem.
And they said, of course, an LLM would be able to solve something like this. And then we take a screenshot of it.
We just put it into our chat GPT app and it doesn't get the pattern. And so I think it's very interesting.
Like it is a notable fact

I was sort of playing devil's advocate against you on these kinds of questions, but this is a very intriguing fact. And I'm extreme.
I think this prize is extremely interesting because we're going to learn, we're going to learn something fascinating one way or another. So with regards to the 85%, separate from this prize, I'd be very curious if somebody could replicate that result because obviously in psychology and other kinds of fields,

which this result seems to be analogous to when you run tests on some small sample of people, often they're hard to replicate. So I'd be very curious if you try to replicate this.
What does the average human perform on ARK? As for the difficulty on how long it will take to crack this benchmark, It's very interesting because the other benchmarks that are now fully saturated, like MMLU math, actually, the people who made them, Dan Hendricks and Colin Burns, who did MMLU and math, I think they were grad students or college students when they made it. And the goal when they made it just a couple of years ago was that this will be a test of AGI.
And of course, it got totally saturated. I know you argue that these are tests of memorization, but I think the pattern we've seen, in fact, Epoch AI has a very interesting graph that I'll sort of overlay for the YouTube version here, where you see this almost exponential where it gets, you know, 5%, 10%, 30%, 40% as you increase the compute across models, and it just shoots up.
And in the GPT-4 technical report, they had this interesting graph of the human eval problem set, which was 22 coding problems. And they had to graph it on the mean log pass curve, basically because early on in training or even smaller models can have the right idea of how to solve this problem.
But it takes a lot of reliability to make sure they stay on track to solve the whole problem. And so you really want to upweigh the signal where they get it right at least some of the time.
Maybe 1 in 100 times, 1 in 1,000. And then so they go from 1 in 1,000, 1 in 100, 1 in 10.
And then they just totally saturate it. I guess the question I have, this is all leading up to is why won't the same thing happen with arc where people had to try really hard bigger models um and now they figured out these techniques that jack cole has figured out with only a 240 million parameter uh language model that can get 35 shouldn't we see the same pattern we saw across all these other benchmarks where you just like sort of eek out and then once you get the general idea, then you just go all the way to 100? That's an empirical question.
So we'll see in practice what happens. But what Jack Cold is doing is actually very unique.
It's not just pre-training an LLM and then prompting it. It's actually trying to do active inference.
He's doing test time, right? He's doing like test time fine tuning. Like test time fine tuning.
And this is actually trying to lift one of the key limitations of LLMs, which is that at inference time, they cannot learn anything new. They cannot adapt on the fly to what they're seeing.
And he's actually trying to learn. So what he's doing is effectively a form of program synthesis.
Because the LLM contains a lot of useful building blocks, like programming

building blocks, and by finding it on the task at test time, you are trying to assemble these building blocks into the right pattern that matches the task. This is exactly what program synthesis is about.
And the way I would contrast this approach with discrete program search is that in discrete program search.

So you're trying to assemble a program from a set of primitives, you have very few primitives. So people working on discrete program search on ARK, for instance, they tend to work with DSLs that have like 100 to 200 primitive programs.
So very small DSL, but then they're trying to combine these primitives into very complex programs. So there's a very deep depth of search.
And on the other hand if you look at what Jack Kool is doing with LLMs is that he's got this sort of like vector program database DSL of millions of building blocks in the LLM that are mined by pre-training DLLM, not just on a ton of programming problems, but also on millions of generated ARC-like tasks. So you have an extraordinarily large DSL.
And then the fine tuning is very, very shallow recombination of these primitives. So discrete program search, very deep recombination, very small set of primitive programs.
And the LLM approach is the same, but on the complete opposite end of that spectrum, where you scale up the memorization by a massive factor and you're doing very, very shallow search. But they are the same thing, just different ends of the spectrum.
And I think where you're going to get the most value for your compute cycles is going to be somewhere in between. You want to leverage memorization to build up a richer, more useful bank of primitive programs.
And you don't want them to be hard-coded like what we saw for the typical ArcDS. You want them to be learned from examples.
But then you also want to do some degree of deep search. As long as you're only doing very shallow search, you are limited to local generalization.
If you want to generalize further, more broadly, this depth of search is gonna be critical. I might argue that the reason that he had to rely so heavily on the synthetic data was because he used a 240 million parameter model because the Kaggle competition at the time required him to use a P100 GPU, which has like a tenth or something of the flops of an H100.
And so obviously he can't use, if you believe that sort of scaling will solve this kind of reasoning, then there you can just rely on the generalization. Whereas if you're using a much smaller model, for context for the listeners, by the way, the frontier models today are literally a thousand X bigger than that.
And so for your competition, from what I remember, the submission you have to submit can't make any API calls, can't go online, and has to run on an NVIDIA Tesla T4. P100.
Oh, is it P100? Yeah. Okay, so again, it's like significantly less powerful.
There's a 12-hour runtime limit, basically. There's a forcing function of efficiency in the eval.
But here's the thing. You only have 100 test tasks.
So the amount of compute you're available for each task is actually quite a bit, especially if you contrast that with the simplicity of each task. So it would be seven minutes per task, basically.
Which for, you know, people have tried to do these estimates of how many flops does a human brain have. And you can take them with a grain of salt, but as a sort of anchor, it's basically the amount of flops an H100 has.
And I guess maybe you would argue that, well, a human brain can solve this question in faster than 7.2 minutes. So even with a tenth of the compute, you should be able to do it in seven minutes.
Obviously, we have less memory than, you know, like petabytes of fast access memory in the brain. And with these, you know, 29 or whatever gigabytes in this H100.
Anyway, I guess the broader question I'm asking is, I wish there was a way to also test this prize with some sort of scaffolding on the biggest models as a way to test whether scaling is the path to get to, you know, solving ARC. Absolutely.
So in the context of the competition, we want to see how much progress we can do with limited resources. But you're entirely right that it's a super interesting open question.
What could the biggest model out there actually do on ARK? So we want to actually also make available a private sort of like one-off track where you can submit to us a VM. And so you can put on it any model you want.
Like you can take one of the largest open source models out there, fine, you need to do whatever you want. And just give us an image.
And then we run it on the H100 for 24 hours or something. And you see what you get.
I think it's worth pointing out that there's two different test sets. There is a public test set that's in the public GitHub repository that anyone can use to train, you know, put it in an open API call, whatever you'd like to do.
And then there's the private test set, which is the 100 that is actually measuring the state of the art. So I think it is pretty open-ended and interesting to have folks attempt to at least use the public test set and go try it.
Now, there is an asterisk on any score that's reported on against the public test set. Because it is public, it could have leaked into the training data somewhere.
This is actually what people are already doing. You can already try to prompt one of the best models, like the latest Gemini, the latest GPT-4, with tasks from the public evaluation set.
And, you know, again, the problem is that these tasks are available as JSON files on GitHub. These models are also trained on GitHub.
So they're actually trained on these tasks. And yeah, that kind of creates uncertainty about if they can actually solve some of the tasks.
Is that because they memorized the answer or not? You know, maybe you would be better off trying to create your own private arc-like, very novel test set. Don't make the task difficult.
Don't make them complex. Make them very obvious for humans.
But make sure to make them original as much as possible. Make them unique, different.
And see how much your GPT-4 and so on or GPT-5 does on them. Well, they're having tests on whether these models are being overtrained on these benchmarks.
Scale recently did this where on the GSM- That's really interesting. They basically replicated the benchmark with different questions.
And so some of the models actually were extremely overfit on the benchmark, like Miseral and so forth. But the Frontier models, Claude and GPT actually did as well on their novel benchmark that they did on the specific questions that were in the existing public benchmark.
So I would be relatively optimistic about them just sort of training on the JSON. I was joking with Mike that you should allow API access, but sort of keep an even more private validation set of these ARC questions.
And so allow API access, people can sort of play with GPT-4 scaffolding to enter into this contest. And if it turns out, maybe later on, you run the validation set on the API.
And if it performs worse than the test set that you allowed the API access to originally, that means that OpenAI is training on your API calls and you go public with this and show them, oh my God, they've leaked your data. We do want to make, we want to evolve the Arc dataset.
That is a goal that we want to do. I think, Francois, you mentioned, it's not perfect.
Yeah, no, Arc is not a perfect benchmark. I mean, I made it like four years ago, over four years ago, almost five now.
This was in a time before LALAMS. And I think we learned a lot, actually, since about what

potential flaws there might be. I think there is some redundancy in the set of tasks, which is, of course, against the goals of the benchmark.
Every task is supposed to be unique in practice. That's not quite true.
I think there's also, every task is supposed to be very novel, but in practice, they might

not be.

They might be structurally similar to something that you might find online somewhere.

So we want to keep iterating and release an Arc 2 version later this year.

And I think when we do that, we're going to want to make the old private test set available.

So maybe we won't be releasing it publicly, but what we could do is just create a test server where you can query, get a task, you submit a solution, and of course you can use whatever frontier model you want there. So that way, because you actually have to query this API, you're making sure that no one is going to by accident train on this data.
It's unlike like the current public art today, which is literally on GitHub. So there's no question about whether the models are actually trained on it.
Yes, they are, because they're trained on GitHub. So by sort of like getting access to acquiring this API would avoid this issue.
And then we would see, you know, for people who actually want to try whatever technique they have in mind using whatever resources they want, that would be a way for them to get an answer. I wonder what might happen.
I'm not sure. One answer is that they've come up with a whole new algorithm for AI with some explicit program synthesis that now we're on a new track.
And another is they did something hacky with the existing models in a way that actually is valid which reveals that movie intelligence is more of getting getting things to the right part of the distribution but then it can reason and in that world i guess that will be interesting and maybe that'll indicate that you know you had to do something hacky with current models as they get better you won't have to do something hacky um i'm also going to be very curious to see how these multimodal models if they will perform natively much better at arc like tests if arc survives three months from here we'll pull up the price uh i think we're about to make a really important moment of like contact with reality by blowing up the prize putting a much big prize pool against it we're going to learn really quickly if there's like low-hanging fruit of ideas. Again, I think new ideas are needed.
I think anyone listening to this might have the idea in their head. And I'd encourage everyone to like give it a try.
And I think as time goes on, that adds strength to the argument that like we've sort of stalled out in progress and that new ideas are necessary to be dark. Yeah, that's the point of having a money prize is that you attract more people, you get them to try to solve it.
And if there's an easy way to hack the benchmark that reveals that the benchmark is flawed, then you're going to know about it. In fact, that was the point of the original Caryl competition back in 2020 for ARK.
I was running this competition because I had released this data set and I wanted to know if it was hackable, if you could cheat. So there was a small money prize at the time that was like 20K.
And this was right around the same time as GPT-3 was released. So people of course tried GPT-3 on the public data.
It scored zero. But I think what the first context taught us is that there is no obvious shortcut right um and well

now there's more money there's going to be more more people looking into it uh well we we're going

to find out we're going to see if if the benchmark is going to survive and you know if we end up with a solution uh that is not like trying to brute force the space of possible arc tasks that's just trained on core knowledge. I don't think it's necessarily going to be in and by itself AGI, but it's probably going to be a huge milestone on the way to AGI.
Because what it represents is the ability to synthesize a task, a problem-solving program from just two or three examples. And that alone is a new way to program.
It's an entirely new paradigm for software development where you can start programming potentially quite complex programs that will generalize very well. And instead of programming them by coming up with the shape of the program in your mind and then typing it up, you're actually just showing the computer what ad which you want.
And you let the computer figure it out. I think that's extremely powerful.
I want to riff a little bit on what kinds of solutions might be possible here and which you would consider sort of defeating the purpose of ARK and which are sort of valid. Here's one I'll mention, which is my friends,

Ryan and Buck stayed up last night because I told them about this. And they were like,

oh, of course, I'll always get to solve this. Good, thank you for spreading the word.

Of course, I'll always get to solve this. And then so they were trying to prompt, I think,

Claude Obis on this. And they say they got 25% on the public ARC test.
And what they did was have other examples of some of the ARC tests and in context explain the reasoning of why you went from one output to another output. And then now you have the current problem.
And I think also maybe expressing the JSON in a way that is more amenable to the tokenizer. And another thing was using the code interpreter.
So I'm curious, actually, what if you think the code interpreter, which keeps getting better as these models get smarter, is just the program synthesis right there? Because what they were able to do was the actual output of the cells, the JSON output, they got through the code interpreter, like right in the Python program that gets right up here. Do you think that the program synthesis kind of research you're talking about will look like just using the code interpreter in large language models? I think whatever solution we see that will score well is going to probably need to leverage some aspects from deep learning models and LLMs in particular.
We've shown already that LLMs can do quite well. That's basically the Jack Call approach.
We've also shown that pure discrete program search from a small DSL does very, very well. Before Jack Call, this was a state-of-the-art.
In fact, it's still extremely close to the state-of-the-art and there's no deep learning involved at all in these models. So we have two approaches that have basically no overlap that are doing quite well and they're very much at two opposite ends of one spectrum where on one hand you have these extremely large banks of millions of vector programs but very very shallow recombination like simplistic recombination and on the other hand you have very simplistic DSLs very very simple, like 100 or 200 primitives, but very deep, very sophisticated program search.
The solution is going to be somewhere in between. So the people who are going to be winning the R competition and who are going to be making the most progress towards near-term NGR are going to be those that manage to merge the deep learning paradigm and a discrete-run search paradigm into one elegant way.
You know you asked like what would be legitimate and what would be cheating for instance. So I think you want to add a code interpreter to the system.
I think that's great. That's sort of legitimate.
The part that would be cheating is try to anticipate what might be in the test set, like brute force the space of possible tasks and then train a memorization system on it. And then rely on the fact that you're generating so many tasks, like millions and millions and millions, that inevitably there's going to be some overlap between what you're generating and what's in the test set.
I think that's defeating the purpose of benchmark, because then you can just solve it with that,

and you need to adapt just by fetching a memorized solution. So hopefully, Arc will resist to that, but you know, nothing, no benchmark is necessarily perfect.
So maybe there's a way to hack it. And I guess we are going to get an answer very soon.
Although I think some amount of fine tuning is valid because these models don't natively think in terms of, especially the language models alone, which the open source models that they would have to use to be competitive here, compete here. They're, you know, they're like natively language.
So they like need to be able to think in the, in this kind of, um, yes, the arc type way. You want to input core knowledge, like arc like core knowledge into the model, but surely you don't need tens of millions of tasks to do this.
Like, core knowledge is extremely basic. If you look at some of these arc type questions, I actually do think they rely a little bit on things I have seen throughout my life.
And for the same reason, like, for example, like something bounces off a wall and comes back and you see that pattern. It's like I played arcade games and I've seen like Pong or something.
And I think, for example, when you see the Flynn effect and people's intelligence as measured on Raven's progressive matrices increasing on these kinds of questions. It's probably a similar story where since now, since childhood, we actually see these sorts of patterns in TV and whatever spatial patterns.
And so I don't think this is sort of core knowledge. I think actually this is also part of the quote-unquote fine-tuning that humans have as they grow up of seeing different kinds of spatial patterns and trying to pattern match to them i would definitely file that under core knowledge like uh core knowledge includes basic physics for instance bouncing or trajectories that would be included but yeah i think i think you're entirely right the reason why a human, you're able to quickly figure out the solution is because you have this set of building blocks, this set of patterns in your mind that you can recombine.
Is core knowledge required to attain intelligence? Any algorithm you have, does the core knowledge have to be in some sense hard-coded or can even the core knowledge be learned through intelligence? Core knowledge can be learned. And I think in the case of humans, some amount of core knowledge is something that you're born with.
Like we're actually born with a small amount of knowledge about the world we're gonna live in. We're not blank slates.
But most core knowledge is acquired through experience. But the thing with core knowledge is that it's not gonna be acquired, like for instance, in school.
It's actually acquired very, very early in the first like three to four years of your life. And by age four, you have all the core knowledge you're going to need as an adult.
Okay, interesting. So, I mean, on the prize itself, I'm super excited to see both the open source versions of maybe with a Lama 70b or something, what people can score in the competition itself.
Then if to sort of test specifically the scaling hypothesis i'm very curious to see if you can prompt on the public version of arc which i guess won't be competitive you won't be able to submit to this competition itself but i'd be very curious to see how if people can sort of crack that and get arc working there and if that would update your views on agi it's would certainly be motivating. We're going to keep running the contest until somebody

puts a reproducible

open source version

into public domain.

So even if somebody

privately beats the Arc eval,

we're going to still

keep the price money

until someone can reproduce it

and put the public

reproducible version out there.

Yeah, exactly.

Like the goal is to

accelerate progress

towards AGI.

And a key part of that

is that any sort of

meaningful bits of progress

needs to be shared, needs to be public. So everyone can know about it and can try to iterate on it if there's no sharing there's no progress what i'm especially curious about is sort of disaggregating the bets of like can we make an open version of this versus is this a thing that's just possible with scaling um and we can i guess test both of them based on the public and the private version.
We're making contact with reality as well with this, right? We're going to learn a lot, I think, about what the actual limits of the compute were. If someone showed up and said, hey, here's a closed source model that I'm getting 50 plus percent on, I think that would probably update us on like, okay, perhaps we should increase the amount of compute that we give on the private test set in order to balance some of the decisions initially are somewhat arbitrary in order to learn about, okay, what do people want? What does progress look like? And I think both of us are sort of committed to evolving it over time in order to be the best or the closest to perfectly we can get it.
Awesome. And where can people go to learn more about the prize and maybe give their hand at it? Parkprize.org.
Which goes live today. It's live now.
One million dollars is on the line, people. Good luck.
Thank you guys for coming on the podcast. It was super fun to go through all the cruxes on intelligence and get a different perspective and also to

announce a prize here so this is awesome thank you for helping break the news thank you for

← Previous: Leopold Aschenbrenner - China/US Super Intelligence Race, 2027 AGI, & The Return of History Next: Tony Blair - Life of a PM, The Deep State, Lee Kuan Yew, & AI's 1914 Moment →