Virtual Cell Models, Tahoe-100 and Data for AI-in-Bio with Vevo Therapeutics and the Arc Institute

57m
On this week’s episode of No Priors, Sarah Guo is joined by leading members of the teams at Vevo Therapeutics and the Arc Institute – Nima Alidoust, CEO/Co-Founder at Vevo Therapeutics; Johnny Yu, CSO/Co-Founder at Vevo Therapeutics; Patrick Hsu, CEO/Co-Founder at Arc Institute; Dave Burke, CTO at Arc Institute; and Hani Goodarzi, Core Investigator at Arc Institute. Predicting protein structure (AlphaFold 3, Chai-1, Evo 2) was a big AI/biology breakthrough. The next big leap is modeling entire human cells—how they behave in disease, or how they respond to new therapeutics. The same way LLMs needed enormous text corpora to become truly powerful, Virtual Cell Models need massive, high-quality cellular datasets to train on. In this episode, the teams discuss the groundbreaking release of the Tahoe-100M single cell dataset, Arc Atlas, and how these advancements could transform drug discovery.
Sign up for new podcasts every week. Email feedback to show@no-priors.com
Follow us on Twitter: @NoPriorsPod | @Saranormous | @Nalidoust | @IAmJohnnyYu | @PDHsh | @Davey_Burke | @Genophoria
Download the Tahoe Dataset

Show Notes:
0:00 Introduction
1:40 Significance of Tahoe-100M dataset
4:22 Where we are with virtual cell models and protein language models
10:26 Significance of perturbational data
17:39 Challenges and innovations in data collection
24:42 Open sourcing and community collaboration
33:51 Predictive ability and importance of virtual cell models
35:27 Drug discovery and virtual cell models
44:27 Platform vs. single hypothesis companies
46:05 Rise of Chinese biotechs
51:36 AI in drug discovery

Press play and read along

Runtime: 57m

Transcript

Speaker 1 Hi listeners, welcome back to No Priors.

Speaker 1 Today we're here with the CEO, CTO, and core investigator of the ARC Institute, as well as the co-founders of Vivo, to talk about their release of the Tahoe 100, the largest single-cell drug-perturbed data set ever created, as well as where we are in AI for biology, why we need a virtual cell model and not just protein structure prediction models, and when we should finally expect to see treatments from this growth of use of machine learning in bio.

Speaker 2 I'm Johnny and I work on single cell RNA sequencing at Vivo.

Speaker 3 I'm Nima. I'm one of the founders together with Johnny.
I'm a quantum chemist by background but I've converted to

Speaker 3 being a computational chemist that loves playing with biological data. And we are building Vivo to really do that, to predict how chemicals interact with cells in different biological contexts.

Speaker 3 Some people people call it the virtual cell. That's basically what we're working on.

Speaker 4 I'm Patrick Su, one of the founders at the Arc Institute, which is working at the interface of biology and machine learning to try to understand and

Speaker 4 one day treat complex human diseases, which are most of the major killers.

Speaker 5 I'm Dave, CTO at Arc Institute, focused on computational biology and building novel AI models for biology.

Speaker 6 I'm Hani, I'm a core investigator at ARC. I work very closely with Dave and Patrick to push our virtual cell initiative.

Speaker 1 Congratulations, everyone. It's a big day.
Let's jump right into it. What is the Tahoe 100 and what is the significance of it?

Speaker 2 So Tahoe 100 is the world's biggest single-cell RNA sequencing data set. And it enables basically a ton of machine learning applications, including things like the virtual cell.

Speaker 2 but it also enables a lot of drug discovery applications.

Speaker 2 And broadly, in the context of where I think we are as a field, it's kind of the beginning of a different way of doing drug discovery, of basically understanding how to build medicines, and basically bringing AI machine learning people into the mix.

Speaker 3 And maybe something I would add there as well.

Speaker 3 Over the last 20 years or so, people have accumulated a massive amount of data points when it comes to protein structures, protein function, how drug molecules interact with proteins.

Speaker 3 But one thing that we haven't had as much is how

Speaker 3 different cells behave in different contexts and how different genes within each of those cells actually functions in the presence of the other genes

Speaker 3 in these different biological contexts.

Speaker 3 We believe this is the era for that right now. You have seen the emergence of protein language models built on the data sets that have been accumulated over the last two decades.

Speaker 3 But now is the era for actually having data on cells, how they function, how they interact with drug molecules.

Speaker 3 And exactly what John is saying, Tahoe is really a landmark data set there that allows us to really measure how drugs interact with different cells from different patient models.

Speaker 3 And that gives us the ability to build similar models that we built in protein language models, but

Speaker 3 in the cellular kind of context.

Speaker 5 If you think about it, actually, like in history of AI, there's, you know, it's punctuated by these data sets that come about, right?

Speaker 5 Like if you think about ImageNet in 2009 that Feife Lee put together, and you look at what that did to drive sort of a nonlinear jump in machine vision, I think the hope here is that by producing data sets, particularly perturbational data sets, that allow us to elicitate

Speaker 5 cellular responses, that we'll be able to actually drive forward the ability to model at the cellular level, not just at the protein level. And so I think this is one of those moments, hopefully.

Speaker 4 Yeah, so lots of people have been talking about what those foundational data sets look like for biology, right?

Speaker 4 And this has been really useful for training protein structure prediction models like AlphaFold built on CASP, the competition built on top of PDB data, but how do you do this for cells and cellular dynamics, which is really what tells us about biology and how it responds in health and disease.

Speaker 4 So I think those are the core steps forward where we want to bring up our ability to study higher levels of abstraction in biology, not just the individual molecular machines, but how they operate in the context of an entire cell.

Speaker 1 Congrats also to the entire ARC team.

Speaker 1 Given you are working on both virtual cell models and protein structure prediction, protein language models, can you contextualize a little bit why we need both and where we are in the progress of each?

Speaker 4 I think we're learning that, right?

Speaker 4 We're looking at these emergent properties of biology by training these large-scale foundation models on nucleic acids and these virtual cell models that we'll talk more about today. And

Speaker 4 we have this debate

Speaker 3 often internally.

Speaker 5 So I have a sort of engineering computer background, so the way I think about it is, if you think about the cell, the DNA lives in the ROM, the read-only memory, right? So it's coding for the cell.

Speaker 5 But then the RNA lives lives in the RAM, so it's like the working memory. And the RNA is constantly changing its expression level.

Speaker 5 It's almost like one of those 1980s graphic equalizers where you get like 20,000 like bars for each gene.

Speaker 5 And it's constantly adjusting its expression level depending on what the cell is experiencing, whether that's sort of the environment, whether it's stress, whether it's aging, whether it's like disease state or healthy state.

Speaker 5 And what we're trying to do when you, you know, with this data, I think, as a field is create these virtual cell models, which in a way is kind of inferring a notional CPU for the cell.

Speaker 5 So like, how does the cell respond to an input? That input could be, you know, an edited gene, it could be an application of a drug, and then how does that reflect in the transcriptomic profile?

Speaker 5 And so that CPU is sort of an analogy to the AI model that you want to want to build.

Speaker 5 And then once you have an AI model, what's really interesting is you can start posing the inverse question, which is, you know, given a cell in a certain disease state that's exhibiting a certain transcriptomic profile, how do I perturb that cell, whether that's a gene edit or that's a drug, to perturb it back into that healthy state?

Speaker 5 And I think that's what's really exciting about this data, which then creates, enables these models, which then enables these tools, and hopefully it can accelerate drug discovery.

Speaker 6 And one thing I will quickly add to that is that

Speaker 6 when we think about different domains in biology, I think, and building AI models of those domains, there are parts of it that

Speaker 6 we are data poor, and there are parts of it that we are compute limited.

Speaker 6 I think when it comes to, for example, DNA language models, again, thanks to the field and decades of having sequenced a ton of genomes, we are not as much data limited, but compute and specifically context and how long we can actually consume DNA and what size of inputs and all of that is actually a big limitation that we have tried to solve.

Speaker 6 But when it comes to kind of cell state models, that is an area that we are absolutely very much data limited because being able to profile cells at single cell resolution is basically a new technology.

Speaker 6 You know, it has emerged over the past decade, but really kind of the explosion of it over the past five, six years.

Speaker 6 And we are just getting there to be able to generate that kind of data at its scale.

Speaker 3 And it's not just the scale, maybe the one thing of, it's not just the scale.

Speaker 3 I think the idea here is that we have, I think, in the order of

Speaker 3 before SC-Basecamp, which is the data set that's being released together with Tahoe on the virtual cell atlas that was created by the ARC folks by basically collating all of publicly available data.

Speaker 3 Before that, I think the number of human cells that had been collated together, it was in the order of 45, 50 million, if you're generous, 60 million single cell data points.

Speaker 3 But the scale is one thing. The question is, you know, how much information content there is in this data as well.

Speaker 3 Quality, yeah. And are they coming from very different biological contexts?

Speaker 3 We actually built

Speaker 3 the early versions of some of those virtual cell models. We call them single-cell foundation models or

Speaker 3 whatever name you actually use for them. And what we saw is that if you actually reduce the number of the 16 million, you downsample it by like

Speaker 3 even 99%. You know, you just use 1% of that data to train your models.
Actually, the model's performance doesn't reduce that much.

Speaker 3 So it means that the information content of the models that you are actually, or the data you're using for training those models, is not amazing.

Speaker 3 So having data that comes from very different biological contexts, that's very key in providing the information content for the model so it can learn.

Speaker 3 And that goes back to what Dave was saying, the perturbational data sets.

Speaker 3 Perturbation allows you to create new contexts, allows you to create new cell states that then the model can learn from and therefore be used for

Speaker 3 different types of applications. And then I'll let Johnny later talk about maybe like what is the challenge with this perturbation, creating this perturbational data set.

Speaker 1 Before we go there, actually, can we we zoom out for a second and just have you describe in layman's terms what the data actually tells you and where the prior data came from, even if it was information poor?

Speaker 2 If you look at the data that's been generated over the past decade, it's basically all kinds of academic groups like us or some people in industry generating all these little data sets.

Speaker 2 And there's a ton of problems with this. First, there's batch effects.
So even one person running an experiment on two different days, their data looks the same, even if it's the same cells.

Speaker 2 And so when you think about trying to build the internet of biology, which is what you need to build this chappy GPT moment. In terms of scale.
In terms of scale, right? Because you need big data.

Speaker 2 Machine learning is not going to do anything for us if we don't have big data.

Speaker 2 You have a data set that's poorly labeled, that's super batchy, that's maybe moderately useful for AI, but it's not there. And so this data set...

Speaker 2 It is basically doubling the size of all the data that's out there cumulatively over the past decade. It covers 50 different cancer models from different patients.

Speaker 2 So it's cells from 50 different patients, 1,200 drug treatments. So it's a really deep and rich data set that effectively has no batch effects.

Speaker 2 And so we think this is actually not only an additional data set for machine learning, we actually think it's the first data set that's going to enable machine learning in this space.

Speaker 4 One thing that might be worth touching on is why perturbational data, right? And I think the key is that we're going from correlation, which is what a lot of biological research is.

Speaker 4 it's descriptive, right? You kind of stare at things, you try to see when you poke this way, what else is changing, and go from associative changes to causation, right?

Speaker 4 And that's where going with genetic or chemical perturbations allows you to have a very clear before and after where you have the set, you know, of causal changes that can actually drive a particular cell state.

Speaker 4 The key is to be able to do this in a generalizable way. So you can look across many different cell types, many different tissue types.

Speaker 4 ML model would need to, in order to learn a general sense of cell state possibilities, would need to then train on that diversity of data as well.

Speaker 5 I mean in a topological sense what the model is trying to do is it's trying to create a manifold in a high-dimensional space and it's a high-dimensional latent space and so actually explore that manifold, the model needs to see lots of different perturbations and responses.

Speaker 5 And then once you do that, you have this generalized manifold that allows the model to make predictions for data that it hadn't seen in its its sample that still fits the manifold.

Speaker 3 To make it even more tangible, the data that was available publicly before this,

Speaker 3 almost the entirety of it, it comes from

Speaker 3 healthy tissue. Very little comes from actually diseased cells.

Speaker 3 And almost all of it, not the entirety, almost all of it is observational in the sense you take cells from a liver sample and you do single cell RNA sequencing on that.

Speaker 3 And that basically has the limitation that Patrick was talking about, does it capture the causality of the gene-gene interactions you're trying to model?

Speaker 3 And the second piece is, does it allow you to model how then a new perturbation actually will impact the cells, whether it's genetic perturbation or drug perturbation, which really is the focus for Tahoe in this in this situation, perturbational data sets.

Speaker 3 So in that sense, like Tahoe, I think when you put all of the perturbational data sets in the world together,

Speaker 3 if you're generous, it's like one to two million single cell data points.

Speaker 3 I mean, this is publicly available data. We don't know as much about what's inside different organizations.
But publicly available is is 2 million, TOHU is 100 million.

Speaker 3 So we have basically increased that massively.

Speaker 3 Now, when you couple that with this huge amount of observational data sets from different species that are in the world, which is basically what the archives did, they put together the entirety of that data set.

Speaker 3 It turns out to be 200 to 230 million single-cell data puts already out there.

Speaker 3 And they have tried to reduce as much as possible the variations between these data sets so they're consistent with each other, so they can train machine learning models on.

Speaker 3 That's the significance of this day.

Speaker 4 I want to make a finer point on this, right? I think the key is if you want a model that can learn about changes going on in the heart or in the brain or in the liver or in the bones, right?

Speaker 4 You need to be able to train across all of those different cell types.

Speaker 4 But if you just look at normal healthy cells, you wouldn't necessarily learn about how the manifold in latent space changes in disease.

Speaker 4 And so being able to look at many different types of tissue types across different cancers is one way to be able to get at those really critical disease states that both basic science but drug discovery really cares about.

Speaker 1 How should we think about 100 million data points or 230 million data points and the scale of this release in terms of where we are, is that enough to be useful?

Speaker 1 What do we know about scaling laws now?

Speaker 6 The short answer is a very hard question. We won't know until we get there.

Speaker 6 What we can draw inspiration from is basically large language models

Speaker 6 in human language and also things like DNA language models

Speaker 6 where we do have enough data to do scaling laws. And

Speaker 6 where we are around there,

Speaker 6 you're around 1 trillion training tokens is where you want to hit

Speaker 6 by and large.

Speaker 3 Like GPT3 was I think half a trillion tokens. ESM3 was 700 billion tokens, so close to a trillion.

Speaker 6 Yeah, so a trillion sounds like a comfortable mark to hit.

Speaker 6 So

Speaker 6 then the question becomes, how do you count tokens? And because cells in the end are not exactly sentences But you know our genes and their expression if you count them as tokens I think this

Speaker 6 this collection that we have put together

Speaker 6 I think gets us close to where we want to be to start asking and answering those questions actually

Speaker 6 So I think you know puts us a few hundred billion training tokens for the kinds of model architectures that we have now think about a cell collection of for these data sets 2,000 to 5,000 genes.

Speaker 3 And each gene and its expression is basically a token in what we're doing. So 200, like 100 million single cell data points is akin to around 200 to 300 billion tokens.

Speaker 3 Now, there's a finer point there, which is like how much of this, how many of these tokens are actually informative to the model.

Speaker 1 I'm not asking this question the correct way, but you will understand the gist of it. How do you decide where in the genetic landscape to start? How do you choose perturbations?

Speaker 2 I think you want to match, and this goes kind of the same with drugs, is you want to match your quest, your like perturbation toolkit, which is like the kind of arrows you throw at the biology against the biology you have.

Speaker 2 So for cancer, that means going after cancer-relevant genes, genes that impact growth of cells, genes that impact DNA regulation, and also drugs that target key cancer pathways.

Speaker 2 So I think for cancer relevant questions, but this data set, even though it's heavily based around these kinds of chemical perturbations of cancer, they also, these pathways are so conserved and fundamental that they broadly apply to the neuroscience space or like to just immune cell development in general.

Speaker 2 So I think it's really the foundation model that's going to be able to take this data, ingest it, build a model, and then train and then understand basically how to like translate that data to a different context entirely.

Speaker 2 Yeah, so this is the key.

Speaker 2 I think this is one of the really special things we have at Vivo, and it's this mosaic platform.

Speaker 2 So it allows us to take cells from many different patients and then in cancer this means all kinds of cancer, lung cancer,

Speaker 2 pancreatic cancer, et cetera, et cetera, from different patients which have their own special genetics and pool them together into a single mosaic tumor which we then can reproducibly screen hundreds or thousands of drugs against.

Speaker 2 And so this key innovation basically allows us to instead of test one cancer model at a time test tens or hundreds. And it makes this a really scalable data generation platform.

Speaker 2 This is what we use to generate Taha 100. When we think about actually how we build these pools in terms of information content, we want to maximize it by covering a lot of cancer patients, right?

Speaker 2 So, this data set, we covered the biggest cancer types by how frequently they occur annually.

Speaker 2 But then, as we continue to grow this data set, we want to think about rare disease, bring in maybe more coverage of different parts of the cancer space

Speaker 2 informed by the machine learning. that basically will help us fill in the gaps in the foundation models.

Speaker 3 Another direction is chemical space. So I think one,

Speaker 3 the question is about, you know, how do we prioritize, but frankly, we have

Speaker 3 when you generate five, when you generate 50 times more perturbational data sets than is publicly available in five weeks, and that those data sets have been available in like,

Speaker 3 have been generated over 10 years, you don't have to prioritize as much. And that's the beauty of it, in my opinion.

Speaker 3 You know, you can go large on the chemical space, you can go large on the patient sample space.

Speaker 3 And that way, you don't have to really a priori come up with a hypothesis about what is it that I have to feed the models. You can just generate as much as you want.

Speaker 3 Exactly. Hypothesis-free, unbiased kinds of data generation.
That's really, I think, the beauty here.

Speaker 6 Yeah,

Speaker 3 let the data surprise you, exactly.

Speaker 3 And this is like one of those things that I like to talk about as well. And I hope these are the

Speaker 3 people we have here are the representatives of the new generation of biologists. But I think one thing that

Speaker 3 has been slowing the progress in bio is the fact that we have always been super hypothesis driven.

Speaker 3 And I think it has the reason is that a lot of these experiments are expensive, you know, that they take a lot of time, a lot of resources.

Speaker 3 But I think now is really the time that the sequencing cost has gone down, single cell sample per cost has gone down, compute cost has gone down.

Speaker 3 I think it's also time to change that kind of mentality in bio as well and go a little more, be a little more courageous, you know, be a little more free willing in terms of your data generation and the kinds of samples you put together.

Speaker 3 So, yeah, I think

Speaker 3 this is the view from an outsider.

Speaker 1 I want to talk about being more ambitious in bio and the open sourcing of this in a second, but I think we should just zoom out and talk about, in layman's terms, what the platform does, and you can correct me if any of this is wrong.

Speaker 1 So, you have these tumors that are a mosaic of cells from different patients representing a huge amount of patient genetic variation. And each mouse then can actually be

Speaker 1 treated with different drugs where the signal you extract after is the interaction of drugs against each of these different patient types.

Speaker 3 That's right. Okay.

Speaker 1 Nobody else thinks this is crazy.

Speaker 3 Not crazy because it's happening every day in our lives, but it's really science fiction, honestly.

Speaker 1 I'm just trying to boil it down to a very simple non-biologist understanding of when you say it's a platform with this super tumor where you can pull all of this data out, it is wild to think about how efficient that is in comparison to, well, we will observe one patient type at a time.

Speaker 4 I think this is actually a super interesting point.

Speaker 4 If you map the number of tokens per experiment across the last 50 years of biomedical research, it'll look like the hockey stick that you know, all investors and founders really know and love, right?

Speaker 4 Just going up and to the right, right? And I think the way that we think about doing science is changing, right, based on this.

Speaker 4 And there's, I think, a roiling discussion today about hypothesis-driven versus hypothesis-free research, right? Should we be doing mechanism versus large-scale profiling?

Speaker 4 But honestly, I think this stuff is going to wash out with scale. Yeah.

Speaker 3 Exactly. You don't have to choose between those two.

Speaker 4 And maybe that's my hot take with this era of machine learning and biology is the vast majority of mechanistic data that's been generated to date is really made to ask very specific, very well-scoped questions.

Speaker 4 And just

Speaker 4 way more tokens per experiment is

Speaker 4 just going to be the way to do it.

Speaker 6 I mean, maybe I can say it another way. I think in biology, what we have done is we have treated humans as the foundation models that ingest information and come up with hypothesis.

Speaker 6 But now we actually want to go beyond that because humans, of course, come with their own

Speaker 6 intuitions and biases and all of that.

Speaker 6 At UCSF, for example, we often ingest say that we use some of our medicinal chemist folks like Kayvon Shulkat as kind of as last layer of a neural network. They have built this intuition of,

Speaker 6 is this chemical that I

Speaker 6 generated via this AI model, does it actually look something that is real, right? Yeah, you know, yeah, and they can't even like verbalize why they think it might be a creative

Speaker 4 people criticize these models for hallucinating but if you think about it the process of scientific research just involves hallucination right that's what creativity

Speaker 1 is yeah so you're all adherence to sutton's bitter lesson in this field as well the in the intuition um being baked into the models or the process is not the right thing we just need to scale data or at least we hope that you don't have to make that choice here you know we've we're seeing evidence of scaling laws in biology across proteins, right?

Speaker 4 That's been shown in the protein language models across DNA, which is what we've shown in our Evo series of models from ARC.

Speaker 4 We're also seeing inference time scaling laws, which in our most recent study.

Speaker 4 So they're sort of early signs of promise, although we'll need good benchmarks and we'll have to look at this across different data types over time.

Speaker 3 And the funny thing for me is that if you've been in this field for long enough, and

Speaker 3 I come from the quantum and computational chemistry side of things,

Speaker 3 every time that you take a certain success from field A and you want to translate it to field B, a lot of people, including

Speaker 3 in our own organizations, they come up with a list of 100 different reasons why field A, the learnings are not applicable in field B.

Speaker 3 But then you get surprised every time. And then the next time when you're trying to do the same thing with B to C,

Speaker 3 the same kind of lists actually start emerging. In a way, I think something that's underappreciated is that

Speaker 3 those same models that learned human language, they are learning the language of structural biology. And then with the Evo work, they are learning the language of DNA, you know?

Speaker 3 And this is incredible.

Speaker 3 I think

Speaker 3 it's not trivial. And by the way,

Speaker 3 again, if you have been in the field long enough, you know that there were a lot of people who were saying, no, no, these protein language models are never going to work.

Speaker 3 You need to have domain-specific kind of models to model this kind of phenomena. So in this way,

Speaker 3 I think this is really the ethos that we have to bring here.

Speaker 3 When Honey was saying that we should use the learnings from those models to actually translate here, that's exactly what we should be doing.

Speaker 3 We should be thinking about what worked and at least try it in these new domains. This is the domain we are talking about and the domain Vivo is excited about and also the virtual

Speaker 3 part of ARC is excited about is the language of systems biology. The first thing you should be doing is to try out the things that worked in the other domains, in this domain.

Speaker 3 Maybe it works, maybe it doesn't work, but if you don't try, you'll never know.

Speaker 1 Music to my ears, given this is one of the only things we have really strong conviction about, at least at the fund we invest out of, that

Speaker 1 a bunch of these techniques, they work and they scale in domains where they, you know, where people are not sure yet,

Speaker 1 quite generally, where we wouldn't be expertise, we wouldn't have the expertise in the

Speaker 1 traditional types of discovery and company building, but actually they seem to apply very generally, right?

Speaker 1 I think this is a great segue to a question of, you know, you're open sourcing the data.

Speaker 3 Why do that?

Speaker 3 Yeah, so we generated the data at Vivo, and Vivo is a private venture-backed company, a startup.

Speaker 3 And so Johnny and I,

Speaker 3 when originally the idea of Tahoe came up, and Johnny told me, yeah, Nima, there is this opportunity. We can generate 100 million single-cell data points.
And I said, like, can we?

Speaker 3 And he said, yeah, yeah, we can. And he said, okay, let's go and do it.

Speaker 3 And I think within, it was within hours that when we were chatting with, so for transparency, Johnny, Honey, and I are co-founders of Vivo.

Speaker 3 When we were talking about it, we said, okay, let's do it and let's open source it. And

Speaker 3 why do we want to do that? Number one, we want to put a new stake in the ground. We want to show that there's a new game in town.
And really, it's possible to up our game. as a community, as a field.

Speaker 3 And we wanted to show that so that people actually move on from this, I I don't know, a million single-cell data points, 100,000 single-cell data points, observation, we up their game and actually go to a much more massive scale.

Speaker 3 So that's number one.

Speaker 3 Number two, we wanted to, so the DNI of our company is to be very, very small,

Speaker 3 a small team of superstars rather than, you know, hiring 100 people.

Speaker 3 Paradoxically, open sourcing actually allows us to do that.

Speaker 3 In a way, like we,

Speaker 3 I think we talked to Dave about Tahoe. It was like the night before the new year, year, like it was sometime between

Speaker 3 Christmas and New Year. And then Dave got really excited about it.
And then the whole, our team got excited about it.

Speaker 3 If there wasn't the open source aspect to it, it wouldn't have been as exciting, you know?

Speaker 3 Like the whole community is getting excited about playing with this data, telling us what's good about this data, what's not good about the data. And that basically allows us

Speaker 3 a team of like three, four people that we have in-house, allows us to keep it that way and basically bring the entire community of like-minded people who have the same mission, building virtual cells, to help us

Speaker 3 in this quest. And for us,

Speaker 3 the idea was we will remove the main bottleneck in doing that. And that's, I think, everybody has been saying, that's data.
And yeah.

Speaker 5 I think that the serendipity for this was, you know, ARC's all about mission-driven science and pushing science forward.

Speaker 5 And we were conceiving of creating what we're calling, and I'm launching this week, called the Arc Virtual Cell Atlas.

Speaker 5 And so the idea there is really, can we, you find high quality curated data sets and put it out there in the world to accelerate virtual cell modeling?

Speaker 5 And then we started chatting and it was like, you got what?

Speaker 5 And it was kind of incredible. And so what we're actually assembling this week is this new atlas.
And so the sort of the star of the show in some ways is the Vivo Tahoe 100 data set.

Speaker 5 We're also augmenting that with observational data. So we've created something called SC Basecamp.

Speaker 5 And you can almost think of it like the Google crawler and index.

Speaker 5 So we've built this agent that that goes onto the internet and basically mines public single-cell sequenced RNA data and then curates it in a very sort of uniform

Speaker 5 way and results in a very nice observational data set. It's about 230 million cells.
You add that to the 100 million cells from the Tau 100. You now have 330 million cells.

Speaker 5 And so this is a really exciting resource

Speaker 5 for scientists around the world who are interested in modeling. at the cell level.
And it's just very compliment

Speaker 5 to have this observational data set that you could possibly pre-train a model on.

Speaker 5 And then the perturbational data set from the TAO 100, which allows you to then to bring in those dynamics and make the model richer and more predictive.

Speaker 4 We're super excited about AI agents for science at ARC and I think across the community.

Speaker 4 I think the capabilities are still very early today, but I think we wanted to show an example of how it can do something really useful. I think it's very clear.

Speaker 4 now that basically all dry lab workflows are going to get automated with agents or with co-pilots. And

Speaker 4 this would ordinarily be the type of thing that a team of computational biologists would be slaving over.

Speaker 4 And our core insight was, well, the Sequence Read Archive is the largest

Speaker 4 sort of repository of all biological data from next generation sequencing.

Speaker 4 You get an NIH grant, for example,

Speaker 4 you sort of post all of this data online, or you publish in a journal, you put all this data as part of the journal publication. But this is extremely fragmented, poorly annotated, really sprawling.

Speaker 5 There's no requirements on your submission of data being uniform.

Speaker 4 Exactly. It's very messy.
And so, you know, we built this agent to basically crawl all of this data, collect it, organize it, process it, and in doing so, basically isolate and

Speaker 4 kind of remove a lot of the kind of batch effects or data biases of previous methods.

Speaker 6 Yeah, I mean, one thing I would add is that the reality is that these data sets have been generated over time, you know, going back a decade. So

Speaker 6 tools have changed, you know, versions of tools have changed, genome builds have changed.

Speaker 6 So by just taking processed data sets and collating together and collecting together, you are kind of infecting and contaminating your data with these analytical effects, batch effects.

Speaker 6 So our idea was...

Speaker 4 Foundational data sets for the entire entire field, right? People work with and interpret and write papers on top of all of this data.

Speaker 6 Yeah, so exactly. I mean, our idea was that to at least remove that.
I mean, there's a lot of kind of technical experimental batch effects, but of course, over a span of this time,

Speaker 6 chemistries of reagents have changed and all of that. But at least we do our part and remove the analytical component.
And we were actually surprised

Speaker 6 to what extent that was actually very

Speaker 6 observable in the data and removing it was actually quite helpful in the end

Speaker 3 maybe on the just on the vivo side this whole idea of this infection of the data sets uh using because of this massive batch of infected this is i like the the phrase maybe johnny you want to talk about like how many people actually did experiment like this is like

Speaker 3 yeah like the

Speaker 2 tahoe yeah well it ended up being actually four people from vivo and we did it i think over like three days in the end think about the leverage that's kind of valuable yeah yeah you know why that's super important?

Speaker 3 It's because sometimes I ask like Honey and Johnny, like, I don't know, what does drug A does to cell line X?

Speaker 3 And there is this word that biologists use in our hand, it does so-and-so.

Speaker 3 And this is what, like, I mean, Dave, you tell me like we come from a different background.

Speaker 3 A computer scientist wouldn't say, like,

Speaker 3 in my hand, this model actually works.

Speaker 6 It's my computer. In my computer.

Speaker 4 In my environment.

Speaker 3 It's kind of a thing.

Speaker 1 There is actually a parallel, but it's not great either.

Speaker 3 Yeah.

Speaker 3 Exactly. So I think that's the genius of what Johnny has built there as well, that this is actually done by very few hands.

Speaker 3 Automation is going to scale you to a certain level. You haven't even done much automation.
And so in that sense, that kind of a thing, the beauty of

Speaker 3 what Johnny designed in building Tahoe is exactly this. A few people, a few hands doing exactly consistent work.
doing 60,000 experiments.

Speaker 3 It's 100 million single-cell data points, but it's actually 60,000 drug-patient interactions, drug-cell line interactions.

Speaker 3 And so like having been done to four people, I think that that's just, it reduces the infectious aspect of data set infection that John Honey was talking about massively.

Speaker 1 So first in history opportunity for scientists and entrepreneurs to go work on this data set and create these virtual cell models. How do you tell the quality? of one of these models?

Speaker 5 I mean,

Speaker 5 the core idea is it's what's its predictive ability, right? And so, you know, you take a cell, you perturb it.

Speaker 5 You can do that either by, you know, from a genetic perspective, you can suppress or

Speaker 5 upregulate genes, and then or apply drugs, and then you look at the response. And so the measure of the model is how well it predicts what we call the differentially expressed genes.

Speaker 5 The reality is today, the best models are very poor at this. Like the predictive ability of the DEGs, as we call them, is in the order of 10%.

Speaker 5 And one of the conjectures.

Speaker 1 Is there an accepted benchmark for this today?

Speaker 5 No, but actually, I think that's something else that the industry would benefit from. It's a good point.

Speaker 5 But if you think about where we want to go,

Speaker 5 one of our conjectures is that one of the reasons the models aren't doing well is not just simply model structure.

Speaker 5 We have a lot of rich structures that we understand in the ML space, machine learning space. The issue is the data quality.

Speaker 5 And so the hope is with this new ARC virtual cell atlas, with the Tahoe 100, that we now finally have a starting point where we can build rich models and get high predictive value of these virtual cell models.

Speaker 5 So that's why this is really kind of an exciting moment in time.

Speaker 4 It might be worth just also speaking plainly. Why do we even care about virtual cell models? We have real cells, right? Why not just do experiments on those, right?

Speaker 4 And I think ultimately biology is very slow, right?

Speaker 4 You know, all of us in this room and many of you watching, have probably tried to pick up pipettes and move clear liquids from one tube to another and grow cells and make animals and deal with biology, which happens in real time, right?

Speaker 4 So, you know, this is a funny story. In the last year of my PhD, my

Speaker 4 advisor tried to convince me to start an aging project, right? Which would have involved, you know, aging animals for

Speaker 4 two years.

Speaker 4 You know, and that's sort of one experimental round. As you can imagine, I declined.

Speaker 3 I was like,

Speaker 4 may I please, sir, graduate?

Speaker 4 But that's actually what happens, right? And I think.

Speaker 1 It's actually just our labor retention plan.

Speaker 3 Right, right. You're constrained by biological time, which is like completely crazy to me coming from an engineering background.

Speaker 1 Yeah. And really important to tons of fields like neurodogen or anything else that takes time to progress.

Speaker 4 Yeah. So, you know, the sort of massively parallelized and silico simulations sounds great, but it needs to be accurate.
It's 10% accurate. you're just simulating noise, right?

Speaker 4 And so, you know, how do we go from a discipline that primarily respects experiments today to something more like physics where theory drives a lot of progress?

Speaker 4 And I think these virtual cell models are a core wedge in making that.

Speaker 1 Well, can you actually make that more concrete then? Like if these virtual cell models work, and we don't even know how to measure them yet because they don't exist in any way that's productive today.

Speaker 1 But

Speaker 1 if they should, then what will

Speaker 1 scientists or the biotech field or patients like expect to gain?

Speaker 3 Maybe from a drug discovery perspective, I can talk away then from the more scientific viewpoint, the argfox.

Speaker 3 So, what we are focused on at Vivo is to predict how a new chemical entity interacts with cells from different patients to patient models. That really is the core of it.
So,

Speaker 3 Patrick was talking about in silico simulation of this. Can I predict in a computer this new chemical structure? Drugs are chemical structures, by the way.

Speaker 3 I hope you won't get surprised by that.

Speaker 3 Whether this chemical structure is gonna take the diseased cell like a cancer cell from a diseased state to a healthy state or for the for the case of cancer actually to kill it literally if i can predict that then my ability in designing new chemicals that do that effectively they don't you know they they kill the cancer cell but they don't kill the healthy cells etc

Speaker 2 that increases massively and that's what we want to that's what we want to do that but that and literally that's the kinds of data we are generating to train those kinds of models anything to add me on yeah i i completely agree i mean a big part of our our future vision and roadmap is that we think there will be a moment where from a virtual cell model a drug is spit out and basically the drug will actually cause a healthy disease cell to become a healthy cell again i think that's kind of the goal and that will reshape how we do any kind of drug discovery one thing i will add there is that there are two dimensions of generalizability to think about one is the the basically a cell kind of dimension and then the chemical dimension.

Speaker 6 On the cell side, you know, every disease is unique. There are similarities.

Speaker 6 There are chunks of cancer mutations and all of that that drives the disease, but there are also very much individual variations.

Speaker 6 And you can observe cells from patients, but you cannot do what for every patient

Speaker 6 that every tumor that arises, what

Speaker 6 these folks do in mosaic. So the idea is that

Speaker 6 using a virtual cell model, you can take those learnings and then apply them to all of these new observations that you can make in patients. So that's one dimension.
The other dimension is chemicals.

Speaker 6 In silico libraries, you have like tens of millions of compounds and biologics, infinite biologics, if you really put your mind to it.

Speaker 6 But most of these have never existed and will never exist because there's no use for them.

Speaker 6 So, a model that can traverse that really massive space of chemistry to find which part of this you actually need to pay attention to and go and synthesize and check

Speaker 6 will be massively enabling because everyone else has libraries that are well behaving, a couple of hundred thousand libraries and they use fragments and try to put them together.

Speaker 6 So, the process of kind of how

Speaker 6 folks design drugs is this slow screening process. And this will allow us to kind of really leapfrog that that

Speaker 6 entire pipeline.

Speaker 4 90% of drugs fail in clinical trials. So we're pretty bad at making drugs, right? And I think that implies two things.

Speaker 4 The first is maybe our drug matter is not very good in the sense that it's potency, its ability to bind the target, its kind of toxicity, its pharmacokinetic profiles, all of those things, right?

Speaker 4 Sort of admit toxic you know, these types of things are not optimal. The other is we're probably drugging the wrong target, right?

Speaker 4 And I think, you know, the sort of idea of these virtual cell models is that you'll be able to significantly cut down the search space of what the right target is.

Speaker 4 And then you can actually, you know, really focus your time on making the right chemical or you know, kind of kind of chemical matter drug composition to actually make the right types of changes in the right types of cells.

Speaker 4 That's why mechanism and drug discovery are so tightly interwoven and that's really what we need these models to help accelerate.

Speaker 3 This is super important because it actually creates this is the gist of

Speaker 3 why we need virtual cells in addition to these protein language models that everybody has been talking about.

Speaker 3 I think I said it before that protein language models speak the language of structural biology.

Speaker 3 How does a protein structure look like? And how does it fold?

Speaker 3 How does it interact with

Speaker 3 a small molecule structure? Exactly. Or how does an antibody bind to another protein?

Speaker 3 This is a binding question. You are binding in the sense that you are trying to see whether one chemical binds to another chemical.

Speaker 3 But biology is more complex. And again, I'm a computational chemist.
I'm a quantum chemist.

Speaker 3 I wished, and actually I bet my PhD on building quantum mechanical models that from physics-based perspective go and simulate these kinds of bindings. But again, it turns out biology is a lot more

Speaker 3 complex.

Speaker 3 There is a context to that protein target that we are trying to hit.

Speaker 3 It's part of a cell. The cell is part of a, for cancer, it's part of a tumor.
The tumor is part of a broader biological system.

Speaker 3 So virtual cells, in my opinion, are going to allow us to go beyond the language of structural biology and venture into the language of systems biology and understand how

Speaker 3 the drug is interacting with the broader biological system rather than simply just one target that we are basically cracking the code on that already with putting language models.

Speaker 1 Well, then I have a higher level systems question. We're at single cell.
Like what about multi-cell and aggregates and organelles? And

Speaker 1 is all that going to be possible in the future?

Speaker 5 Yes. I mean I think like the first thing on the virtual cell

Speaker 5 direction is like what's the or any modeling is like what's the right level of abstraction and so I think our belief around the room is the right level of abstraction is at the sort of transcriptomic level because you have these very complex gene pathways.

Speaker 5 And so whenever a cell is changing to its environment reacting, that it will be reflected and is reflected in the transcriptome. So

Speaker 5 I think that's the first question, even within a cell, what's the right abstraction?

Speaker 5 And so we think this has, you know, because if you think about a cell, it's like this very exquisite piece of machinery.

Speaker 5 And like, you know, you could make it an arbitrarily complex model, but we believe this sort of genetic level is the right level to model.

Speaker 5 I think going beyond that, yeah, you can create very advanced models. I think you see people doing spheroids and organoids.

Speaker 5 So you take mixtures of cells and run them together and you try to simulate, say, cardiac tissue or brain tissue. What's really interesting is, you know, maybe you have an organoid with 20,000 cells.

Speaker 5 You can then still apply these techniques that we're talking about, like take these drug perturbations and apply them to these cells or these genetic perturbations and look at the responses.

Speaker 5 And so what's happening now is you're going beyond a single cell, but you're sort of getting the intercellular dynamics captured as well in the models.

Speaker 5 But I think it just naturally ladders up from single cell through to these sort of more multi-cell.

Speaker 3 One

Speaker 3 touch, one small comment on that one is that it is a single cell that we are modeling, but that context dependency also captures a lot of the effects that arise from the environment.

Speaker 3 So what Johnny's, the models that we have are actually spheroid models in this specific experiment for Tahoe, But we also have in vivo models, we have humanized mice that, you know, they capture some of the immune system of the mouse.

Speaker 3 So in a way, yes, you are simulating, you're building an in silico model of a cell, but if a model is any good, it can simulate it in different biological contexts, in the presence of this kind of immune environment, in the presence of, I don't know, in this kind of a tumor versus this other kind of tumor, in the presence of this mutation versus this other mutation.

Speaker 3 So we call it single cell. But the whole idea of having so many single cell data points is that you have it in different contexts.

Speaker 3 Yeah, yeah, that seems like a really important nuance there.

Speaker 6 Yeah, the information of the environment is filtered through the cell. So if you're observing the cell with enough resolution, you can even predict.
It should be represented in the model.

Speaker 4 You can also add spatial data.

Speaker 3 Oh, yeah, definitely.

Speaker 1 Okay, I have a few hot take questions to end with.

Speaker 1 Nima, we'll start with you because you were, we were having a passionate discussion about why it was really important to you that Viva be a platform company versus a single hypothesis company like 99.9% of biotechs out there.

Speaker 1 What is the difference?

Speaker 3 I think the difference is

Speaker 3 the kind of team and team you build and the ambition that you have.

Speaker 3 It's a single hypothesis company is basically

Speaker 3 the idea that

Speaker 3 human being the foundation model that Honey was talking about is basically we come up with a hypothesis. And then we go test the hell out of it in different kinds of experiments.

Speaker 3 And we basically are very heavily incentivized to, we, I mean, like a company that's built on that hypothesis, they're very heavily incentivized to make that hypothesis work.

Speaker 3 What you see actually in biotech a lot of times is that you take a drug to the clinic after you have tested it on three different patient samples, you know.

Speaker 3 If you actually are a platform company, what that means is that what you're trying to do is to have enough hypotheses. and to have such

Speaker 3 a hypothesis-free way of generating new hypotheses that doesn't make you wedded to one hypothesis, and therefore it allows us to allow you to be actually a lot more scientific in your quest for new drugs or questions for new targets to

Speaker 3 treat disease. I think that's why I think the core of what, and we had a lot of hypothesis initially to go after and just build, you know, one asset, two-asset company.

Speaker 3 But we decided to make it a platform company because it allows us to be a lot more rigorous in terms of what we actually decide to take to the clinic.

Speaker 1 There has been a lot of news recently on

Speaker 1 a different question, which is the rise of Chinese biotechs.

Speaker 1 For the core members of the research community here, is that a threat? How do you think of it?

Speaker 4 Well, their cost basis is definitely more competitive.

Speaker 3 I think

Speaker 4 a lot of the discussion around the water cooler in the biotech and pharma industry is how are they able to do it at this pace? How are they able to do it at this cost?

Speaker 4 Why do their data packages look so good? They have safety, they have TOCs, they have all these IND enabling studies. You know, it's really competitive.

Speaker 4 And I think folks got really surprised at the efficiency of the pipelining and the ability to manufacture all these different antibodies primarily. And I think that's great for the industry, right?

Speaker 4 I think everybody, including patients, investors, you know, the biotech companies themselves want lower cost basis, right? We want

Speaker 4 the ability to actually make molecules that work faster.

Speaker 4 And I think all these things will, you know, kind of compete in the system to be able to reduce the right now, like pretty high cost basis of doing these things, you know, stateside, right?

Speaker 4 Well, I think one of the core challenges right now is we have a wide array of services and CROs and contract research collaborators that you can try to chain together.

Speaker 4 There's, you know, kind of previously the virtual biotech was a concept that was very much in fashion, right?

Speaker 4 Folks found out just in reality, when you try to do this, even though it looks really good on paper, it's incredibly slow, right?

Speaker 4 So then folks tried the other way, which is let's just fully vertically integrate and just own everything. Well, that was incredibly expensive, right?

Speaker 4 And obviously the answer is maybe more goldilocks in the middle. We need really competent vendors and CROs that understand the drug discovery and development process.

Speaker 4 Then we need the kind of individual companies to be able to run in a really capital efficient and lean way.

Speaker 4 And I think the industry is trying to reshape around these changes right now to figure out the right way to build startups, the right way to build drugs.

Speaker 3 Yeah, I think I totally agree.

Speaker 2 I think it's an important moment. I think one thing that I haven't seen is that we actually acknowledge it.
Like it just kind of hit us in the face.

Speaker 2 And I think it's because

Speaker 2 I think the U.S. is the innovation hub, but I think we need to basically be more intentional about that in biotech.
I I think you see innovation in tech. I think you see that as kind of the mantra.

Speaker 2 I think innovation in biotech has actually been viewed as kind of the things that the Chinese CROs and companies are good at. I think what we're finding out is like that's not actually innovation.

Speaker 2 And so

Speaker 2 my hypothesis is that like what the kinds of things that we're working on, we're really putting big data and AI into kind of the first layer of how we do biology.

Speaker 2 That's what innovation should look like in our space. And if we don't as a community push that forward, we're not going to have that innovation in the industry.

Speaker 3 And Johnny's saying it slapped us in our face, like it caught us by surprise.

Speaker 3 But actually, one of the first conversations that Johnny and I had three years ago when we were thinking about selling vivo was actually Johnny

Speaker 3 was actually telling me about this thing that's happening in China as well and this whole thesis around commoditization of a lot of the things that we think are so massively important, you know, like molecular design, et cetera, et cetera.

Speaker 3 So I think in that sense,

Speaker 3 I do agree. And I think

Speaker 3 there is two ways to do it, like by regulatory capture, try to lobby the government and everything to put a limit on how much we can interact with the Chinese companies.

Speaker 3 Here's the other way, make it part of our ecosystem and change our thinking about business models, the way we build our teams, to Patrick's point.

Speaker 3 Do we build a fully integrated team with $100 million in the bank or a small 14% team like we are at Vivo? I think these are the kind of things we should be thinking about.

Speaker 3 And actually, I want to make this into this bigger statement that's a little more Reagan-esque.

Speaker 3 I think it's morning in bio in a way that, you know, like there is a, there's a different, we should be playing a different kind of game here.

Speaker 3 And if you want to stick to the same old, old school way of doing things, it's not going to work. Old school way is what? It's a lot of planning.

Speaker 3 You know, if I had a cent, we were texting about this with Dave a couple of days ago.

Speaker 3 If I had this, I don't know, a penny for every time some massive organization announces this extraordinary, impressive thing and they say, oh, we are going to give it to you in three years, to five years.

Speaker 3 Honestly, I would be super rich right now.

Speaker 3 This is the ethos in bio. You announce this massive thing and you say you're going to do it in three to five years.
No, I think it's the time, we have the tools.

Speaker 3 It's the time to build and it's the time to do it right now. That's the way Evo2 actually gets created in like a matter of months, from the first Evo paper to what happened.

Speaker 3 That's the way Tahoe gets created.

Speaker 3 The second piece is small, super focused teams of superstars, massive organizations, the vertical integrated ones, they actually, it's not just the capital intensivity, they're actually very inefficient too.

Speaker 3 They go very slowly. You actually bug them down in a lot of bureaucracy.

Speaker 3 And I think the third piece is associated with this naysaying thing, again, like in everything you want to do in bio, there are a lot of this very strong biologists that would tell you why this is not going to work.

Speaker 3 Yeah. I think that has to change.
We have to change it. We have to think very differently about this.
We have to try things out. And now we have the tools to do.

Speaker 4 On this last point, when I talk to pharma CEOs, they'll say, oh, AI and drug discovery, very interesting, you know, but you know what?

Speaker 4 I actually don't spend that much of my top-line budget on drug discovery. Most of it is wrapped up in clinical development.

Speaker 4 And so, a lot of them actually are much more excited about things like natural language workflows to summarize clinical trial documents, right?

Speaker 4 Which are you know, these massive regulatory filings and summarize them and make it easier to write these things and read them. And, you know, just more normal AI certifying cohorts.

Speaker 3 Yeah.

Speaker 4 Yeah.

Speaker 1 And reducing costs in that part of the cycle.

Speaker 4 And I think the thing that they're going to see as these models get better, right, virtual cell models actually help you find the right target where you can actually point the cannon in the right direction and measure twice and cut once is that the cost basis for the industry will go down and the accuracy should go up.

Speaker 1 I'm really glad both of you actually just brought up the naysayers because if you weren't going to, I was going to. I think I have now been pitched

Speaker 1 AI for biotech companies for at least a decade, right? And we haven't seen lots of, I mean, there's also just the natural life cycle of bringing treatments to market.

Speaker 1 So let's say like, you actually need 11 years plus generally. But

Speaker 1 like, what would you, if you were gonna leave like a broader audience with like a single claim about why that is true, obviously there were different approaches from like, let's say, you know, a decade ago, it might have been computer vision and consumer scale sequencing data, right?

Speaker 3 But

Speaker 1 why should this work now? Or when should we actually begin to see treatments from these approaches in machine learning?

Speaker 5 I mean, I go back to like analogies in the machine learning space. We had, you know, we call them artificial neural networks for a long, long time.

Speaker 5 And then people would get all wrapped up around, oh, this perceptron can't model an exclusive OR gate or whatever.

Speaker 5 Perceptron, what is this?

Speaker 3 1990s?

Speaker 3 Exactly.

Speaker 5 And it's sort of just like, you know, bounced around for a while.

Speaker 5 And it wasn't until we had increase in compute, increase in data, and then, you know, more sophisticated models that you sort of hit these nonlinear inflection points, right?

Speaker 5 And I mentioned earlier about the ImageNet moment in 2009. And what happened there was that it sort of drove development of convolutional neural networks.

Speaker 5 I think it was the AlexNet was the model that really showed the way. And before that, you know, we would think, oh, only humans could recognize images at high quality.
Computer will never do it.

Speaker 5 And of course, now we know computers can do it, do that better than humans. And so I think it's the same thing in AI and biology.

Speaker 5 And when I look, you know, you know, sort of coming into this relatively new, like when I see the capability on single-cell sequencing, it's kind of mind-blowing if you're not a biologist, but like this idea that we can take, you know,

Speaker 5 at a single cell resolution, we can look at how its

Speaker 5 expression is changing over time. Like it's incredible.
You take that, you then take the ability to generate lots of data around that.

Speaker 5 And then you take these much more sophisticated models and model training. And suddenly, things are happening.

Speaker 5 Like if you look at the EVO2 model, we trained it on 9.3 trillion nucleotides, but we didn't tell it anything about DNA.

Speaker 5 We're just like, here's a lot of DNA on the planet across, you know, every single piece of DNA we could get a hold of. And then what did the model learn? It started learning all sorts of things.

Speaker 5 Like it knows where ribosome binding sites are. It knows where code and degeneracy is.
And then one of the things we showed is it can actually predict

Speaker 5 the pathogenicity of the BRCA1 variant, right, which is known to drive breast-novarian cancer.

Speaker 5 And it does that with an area under the rock curve of like 0.94 if i recall looking at honey um uh and uh i mean this is incredible and we never taught it anything it just learned this stuff zero shot and so i think we're at that point of inflection now i think all of us are kind of uh you know would be on the same uh uh

Speaker 5 all agree to this that i think we're at that point of time now where we're going to see that inflection and it's going to be about it's going to be the data right that's going to be the difference between where we were yesterday and where we are starting this week.

Speaker 5 It's going to be the data.

Speaker 4 So are we, so we're somewhere between GPT-1 and GPT-4, right, in biology, but where do you guys think we are?

Speaker 6 I'm like, I'm more like two.

Speaker 2 We're like developing GPT-2, but we're like, we don't have enough data, guys. We need more data.

Speaker 3 I think

Speaker 3 if you actually go a little deeper and you talk about different

Speaker 3 domains, I think in the

Speaker 3 protein models, we are past GPT-3.

Speaker 3 When it comes to single-cell models and virtual cell models, yeah, I think GPT-1 to 2 right now, I think we're closer to GPT-1 than 2.

Speaker 1 That's a pretty exciting timeline, though, if you just take the progress and the pace of progress in other domains and apply it here.

Speaker 6 But I think the difficulty is exactly what you said, that with GPT-4,

Speaker 3 like

Speaker 6 you immediately knew what you had. But but if we hit GPT-4

Speaker 6 cell estate models, for example, for drug discovery, as you said, it will take some time to actually prove that point.

Speaker 6 And I think that there's the law of small numbers always takes hold in drug discovery, right? You know, a platform that takes your success rate from 10% to like 30%

Speaker 6 is amazing, but still, it's like 30%.

Speaker 6 You need to get lucky

Speaker 5 in the drug development cycle, which is exactly order 10 years. So you still got to wait for that to prove itself out.

Speaker 4 To slowly go up in a rock, 10-year rolling window.

Speaker 1 right although if we're going to have six optimists here then i i will say like we're just going to treat it and systems people we're just going to treat it as a system and if this was a terribly uh debilitating bottleneck at the beginning then um hopefully it's a breakthrough i think that's a that's a great note to end on um honey dave patrick nima and johnny thank you so much for doing this and congratulations it's the data

Speaker 1 Find us on Twitter at No CryersPod. Subscribe to our YouTube channel if you want to see our faces.
Follow the show on Apple Podcasts, Spotify, Spotify, or wherever you listen.

Speaker 1 That way, you get a new episode every week. And sign up for emails or find transcripts for every episode at no-buyers.com.