The Power of Quality Human Data with SurgeAI Founder and CEO Edwin Chen

32m
In the generative AI revolution, quality data is a valuable commodity. But not all data is created equally. Sarah Guo and Elad Gil sit down with SurgeAI founder and CEO Edwin Chen to discuss the meaning and importance of quality human data. Edwin talks about why he bootstrapped Surge instead of raising venture funds, the importance of scalable oversight in producing quality data, and the work Surge is doing to standardize human evals. Plus, we get Edwin’s take on what Meta’s investment into Scale AI means for Surge, as well as whether or not he thinks an underdog can catch up with OpenAI, Anthropic, and other dominant industry players.

Sign up for new podcasts every week. Email feedback to show@no-priors.com

Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @echen | @HelloSurgeAI

Chapters:

00:00 – Edwin Chen Introduction

00:41 – Overview of SurgeAI

02:28 – Why SurgeAI Bootstrapped Instead of Raising Funds

07:59 – Explaining SurgeAI’s Product

09:39 – Differentiating SurgeAI from Competitors

11:27 – Measuring the Quality of SurgeAI’s Output

12:25 – Role of Scalable Oversight at SurgeAI

14:02 – Challenges of Building Rich RL Environments

16:39 – Predicting Future Needs for Training AI Models

17:29 – Role of Humans in Data Generation

21:27 – Importance of Human Evaluation for Quality Data

22:51 – SurgeAI’s Work Toward Standardization of Human Evals

23:37 – What the Meta/ScaleAI Deal Means for SurgeAI

24:35 – Edwin’s Underdog Pick to Catch Up to Big AI Companies

24:50 – The Future Frontier Model Landscape

26:25 – Future Directions for SurgeAI

29:29 – What Does High Quality Data Mean?

32:26 – Conclusion

Press play and read along

Runtime: 32m

Transcript

Speaker 2 Hi, listeners. Welcome back to No Priors.

Speaker 2 Today Elad and I are here with Edwin Chen, the founder and CEO of Surge, the bootstrapped human data startup that surpassed a billion in revenue last year and serves top-tier clients like Google, OpenAI, and Anthropic.

Speaker 2 We talk about what high-quality human data means, the role of humans as models become superhuman, benchmark hacking, why he believes in a diversity of frontier models, the scale meta not MA deal, and why there's no ceiling on environment quality for RL or the simulated worlds that laps want to train agents in.

Speaker 2 Edwin, thanks for joining us. Great, great.

Speaker 1 Seeing you guys today.

Speaker 2 Surge has been really under the radar until just about now.

Speaker 2 Can you give us a little bit of color on sort of scale of the company and what the original founding thesis was?

Speaker 1 So we hit over a billion written revenue last year.

Speaker 1 We

Speaker 1 are kind of like the biggest human data player in the space and we're about 100, a little over 100 people.

Speaker 1 And our original thesis was, we just really believed in the power of human data to advance AI.

Speaker 1 And we just had this really big focus from the start of making sure that we had the highest quality data possible.

Speaker 3 Can you give people context for how long you've been around, how you got going, et cetera? I think, again, you all have accomplished an enormous amount in a short period of time.

Speaker 3 And I think, you know, you've been very quiet about some of the things you've been doing.

Speaker 3 So it'd be great to just get a little bit of history and, you know, when you started, how you got started, and how long you've been around.

Speaker 1 Yeah. So we've been around for five years.
I think we just hit our five-year anniversary. So we started in 2020.
So before that, so I can give some of the context.

Speaker 1 So before that, I used to work at Google, Facebook, and Twitter. And one of the, like basically the reason we started Surge was I just used to work on ML at a bunch of these big companies.

Speaker 1 And just a problem I kept running into over and over again was that it really was impossible getting the data that we needed to train our models.

Speaker 1 So it was just this big blocker that we faced over and over again. And there was just like so much more that we wanted to do.

Speaker 1 Like even just the basic things that we wanted to do, we struggled so hard to get the data. It was really just the debate blocker.

Speaker 1 But then simultaneously, there were all these more futuristic things that we wanted to build.

Speaker 1 Like if we thought of the next generation AI systems, if we could barely get the data that we needed at the time to solve like just building a simple set of analysis classifier, if we could barely do that, then like, how would we ever

Speaker 1 advance beyond that? So that really was the biggest problem. I can go into more of that, but that was what we faced.

Speaker 3 And then you guys are also known for having bootstrap the company versus raising a lot of external venture money or things like that.

Speaker 3 Do you want to talk about that choice in terms of going profitable early and then scaling off of that?

Speaker 1 In terms of why we didn't raise, so I think, I mean, a big part of it was obviously just that we didn't need a money. I think we were very, very lucky to be profitable from the start.

Speaker 1 So we didn't need a money. It always felt weird to give up control.
And

Speaker 1 like one of the things I've always hated about Silicon Valley is that you see so many people raising for the sake of raising.

Speaker 1 Like I think one of the things that I often see is that a lot of founders that I know, they don't have some big dream of building a product that solves some idea that they really believe in.

Speaker 1 Like if you talk to a bunch of YC founders or whoever it is, like what is their goal?

Speaker 1 It really is to tell all their friends that they raised $10 million and to show their parents they got a headline on TechCrunch. Like that is their goal.
Like I think of like my friends at Google.

Speaker 1 They often tell me, oh yeah, I've been at Google or Facebook for 10 years and I want to start a company. I'm like, okay, so what problem do you want to solve? And they don't know.

Speaker 1 They're like, oh, yeah, I just want to start something new. I'm bored.
And it's weird because they can like pay their own salaries for a couple of months.

Speaker 1 Again, they've been in Google and Facebook for 10 years. They're not just like fresh out of school.

Speaker 1 They can pay their own salaries, but the first thing they think about is just going out and raising money.

Speaker 1 And I've always thought it weird because they might try talking to some users and they might try building an MVP, but they kind of just do it in this throwaway manner where the only reason they do it is to check off a box on a startup accelerated application.

Speaker 1 And then they'll just pivot around these random product ideas. And they'll happen to get a little bit of traction so that the VC DMs them.

Speaker 1 And so they spend all their time tweeting and they go to these VC dinners. And it's all just so that they can show the world that they raise a big amount of money.

Speaker 1 And so I think raising immediately always felt silly to me. Like everybody's default is just immediately raise, but you got to think about it from first principles.

Speaker 1 Like if you didn't know how Silicon Valley worked, if you didn't know that raising was a thing, like why would you do that?

Speaker 1 Like what is money really going to solve for 90% of these startups where the founders are lucky to have some savings?

Speaker 1 I really, really think that your first instinct should be to go out and build whatever you're dreaming of.

Speaker 1 And sure, if you ever run into financial problems, sure, think about raising money then, but don't waste all this effort and time up front when you don't even know what you do with it.

Speaker 3 Yeah, it's funny. I feel like I'm one of the few investors that actually tries to talk talk people out of fundraising often.

Speaker 3 Like, I actually had a conversation today where the founder was talking about doing a raise, and I'm like, why? You know, you don't have to, you can maintain control, et cetera.

Speaker 3 And then the flip side of it is I would actually argue outside of Silicon Valley, too few people raise venture capital when the money can actually help them scale.

Speaker 3 And so I feel like in Silicon Valley, there's too much, and outside of Silicon Valley, there's too little. So it's this interesting,

Speaker 3 you know, spread of different models that sort of stick.

Speaker 1 Edwin,

Speaker 2 what would you say to founders who

Speaker 2 feel like there's some external validation necessary to,

Speaker 2 especially like a hire a team or scale their team? This is a very like common complaint or rationale for going and raising more capital.

Speaker 1 I think about it in a couple of ways. So I guess it depends on what you mean by external validation.
Like in my mind, again, like I often think about things from a perspective of

Speaker 1 Are you trying to build a startup that's actually going to change the world? Like, do you have this big thing that you're dreaming of? And if you have this big thing that you're dreaming of, you

Speaker 1 like, why do you care?

Speaker 3 Maybe the way to think about it is in Sarah's context, like if you haven't, say you're a YC founder, you haven't been at Google, you haven't been at Meta, you haven't been at Twitter, you don't have this network of engineers, you're a complete unknown, you haven't worked with very many people, you're straight out of school, how do you then attract that talent?

Speaker 3 And to your point, you can tell a story of how you're going to build things or what you're going to do, but it is a harder obstacle to basically convince others to join you or for others to come on board or to have money to pay them if you haven't, if you don't have long work at history.

Speaker 3 So I think maybe that's the point Sarah's making.

Speaker 1 Yeah. So I mean, I think I would differentiate between maybe two

Speaker 1 things. Like one is, do you need the money?

Speaker 1 So first of all, like there's a difference between people who are, yeah, like literally fresh out of school, or maybe I've never gone to school in the first place.

Speaker 1 And so maybe they don't have any savings. And so they literally need some money in order to live.

Speaker 1 And then there's others who, okay, like let's assume that you don't necessarily need money because, again, you've been working Google or Facebook for 10 years, like or five years, whatever it is, you have some savings.

Speaker 1 So I would say one of the questions is, again, like it kind of, the path kind of differs depending on, depending on those, those, those two choices or those two scenarios.

Speaker 1 But I think one of the questions is, well, do you really need to go out and hire all these people?

Speaker 1 Like one of the things I often see, again, like, I'm curious what you guys see, but one of the things I often see is founders will tell me, like,

Speaker 1 okay, so I'm trying to think about the first few hires I'm going to make. And they're like, yeah, I'm going to hire a PM.
I'm going to hire a data scientist.

Speaker 1 Yeah, these are one of my first five to 10 hires. I'm like, what? Like, this is just wild to me.
Like, I would never hire a data scientist one of the first few people in a company.

Speaker 1 And I'd say this because I used to be a data scientist.

Speaker 1 Like data scientists are great when you want to optimize your product by 2% or 5%, but that's definitely not what you want to be doing when you start a company.

Speaker 1 You're trying to swing for 10X or 100X changes, not worrying and nitty-gritty mouse small percentage points are just noise anyways. And it's a similar product managers.

Speaker 1 Like product managers are great when your company gets big enough, but at the beginning, you should be thinking about yourself about what product you want to build.

Speaker 1 And your engineer should be hands-on. They should be having great ideas as well.

Speaker 1 And so product managers are kind of this weird conception that big companies have when your engineers don't have time to be in the weeds on the, on the details and try things themselves.

Speaker 1 And it's not a role that you come up with you hadn't heard of it before.

Speaker 3 So I guess with the initial surge team, it sounds like you had sort of a small initial tight engineering team. You guys started building product.
You were bootstrapping off of revenue.

Speaker 3 You know, at this point, you're at over a billion dollars in revenue, which is amazing.

Speaker 3 How do you think about the future of how you want to shape the organization, how big you want to get, the different products you're launching and introducing?

Speaker 3 Like what do you view as sort of the future of surge and how that's all going to evolve?

Speaker 2 Before we do that, can you just explain like what the at whatever level of detail makes sense here, like what the billion dollars of revenue is?

Speaker 2 Maybe like how product supports the company, who your data, who your humans are, because I think there's just very little visibility into into all of that.

Speaker 1 So in terms of what our product is, I mean, at the end of the day, our product is our data.

Speaker 1 Like we literally deliver data to companies. And that is what they use to train and evaluate our models.

Speaker 1 So imagine, you know, one of your one of those frontier labs and you want to improve your model, your model's coding abilities.

Speaker 1 what we will do on our end is we will gather a lot of coding data and so this coding data may come in different forms it may be sft data we are literally writing out coding solutions or maybe unit tests like these are the tests that a good uh that a good piece of code must pass maybe a preference data where it's okay like here are two pieces of code or here are two coding explanations which one's better or these might be like verifiers like okay um here's a web app that i created i want to make sure that in the top right hand of the screen there's like a there's a login button Or I want to make sure that when you click this button, something else happens.

Speaker 1 Like there's a bunch of different forms that this data may take. But at the end of the day, what we're doing is we're delivering data that will basically help the models improve on these capabilities.

Speaker 1 Very, very related to that is this notion of evaluating the models. Like you also want to know, yeah, is this a good coding model? Is it better than this other one?

Speaker 1 What are the errors in which this model is weak and this model is worse? Like what insights can we get from that? And so in addition to the data, oftentimes we're delivering insights to our customers.

Speaker 1 We're delivering loss patterns. We're delivering failure modes.
So there may be a lot of other things like related to data, but at the end of it, it's like this universe of

Speaker 1 applications or just like just universe around the data that we deliver and that is our product.

Speaker 2 Yeah. And maybe going back to Elad's question,

Speaker 2 maybe like product isn't actually the right word here, but what's repeatable about the company or what are like core capabilities that you guys have that you would say your competitors

Speaker 1 fail to meet the mark?

Speaker 1 The way we think about a company is that, and the way we differentiate from others is that a lot of other companies in this space they're essentially just body shops what they are delivering is not data they are literally just delivering warm bodies to um

Speaker 1 to uh to companies and so what that means is like at the end of the day they don't have any technology and one of our fundamental beliefs is that again quality is the most important thing at the end of the day like is this high quality data is this a good coding solution is this a good unit test is this mathematical problem solved correctly is this a great poem

Speaker 1 and basically a lot of companies in this space like uh just as a relic of how things have worked out historically it's that like historically a lot of the companies they uh

Speaker 1 they've treated quality and data as commodity like one of the ways we often think about it is imagine you're trying to draw a bounty box around a car like sarah you and i we're probably going to draw the same bounty box like ask hemingway and ask a second grader well at the end of the day we're all going to draw the same bounty box there's not much difference that we can do.

Speaker 1 So there's a very, very low ceiling on the bar of quality. But then take something like writing poetry.
Well, I suck at writing poetry.

Speaker 1 Hemingways definitely want to write a much better poem than I am. Or imagine a, I don't know, a VC pitch deck.

Speaker 1 You're going to write a much better, you're going to create a much better pitch deck than I will.

Speaker 1 And so there's almost an unlimited ceiling in this Gen AI world on the type of quality that you can, that you can build. And so the way we think of our product is like, we have a platform.

Speaker 1 We have actual technology that we're using to measure the quality that our workers or annotators are generating. If you don't have that technology, if you don't have any way of measuring it.

Speaker 3 Is the measurement through human evaluation? Is Is it through model-based evaluation?

Speaker 3 I'm a little bit curious, like, how you create that feedback loop, since to some extent, it's a little bit of this question of how do you have enough evaluators to evaluate the output relative to the people generating the output, or do you use models, or how do you approach it?

Speaker 1 Like, I think one analogy that we often make is think about something like Google search, or think about something like YouTube.

Speaker 1 Like, you have, you know, millions of search results, you have millions of web pages, you have millions of videos. How do you evaluate the qualities of these videos?

Speaker 1 Like, is this a high-quality, like, is this a high-quality web page? Is it informative? Or is it really spammy? Like, in the way you do this is like, you just need, I mean, you gather so many signals.

Speaker 1 You gather like page-dependent signals, you gather like user-dependent signals, you gather activity-based signals, and all of these feed into a giant ML algorithm at the end of the day.

Speaker 1 And so, in the same way, we gather all these signals about our annotators, about the work that they're performing, about like their activity on the site, and we just feed it into a lot of these different, like we basically have an ML team internally that builds a lot of these algorithms to measure all of this.

Speaker 2 What is changing or breaking as you are like scaling increasingly sophisticated annotations, right?

Speaker 2 Like if model quality baseline is going up every couple of months, then the expectation is it like exceeds what might have been a random human at some point, as you said, like can draw a bounding box into all of these different fields

Speaker 2 where

Speaker 2 we have modeled better than the 90th percentile at some point.

Speaker 1 So this is actually something that we do a lot of internal research on ourselves as well.

Speaker 1 So there's basically this field of AI alignment called scalable oversight, which is basically this question of how do you, how do you like have models and humans working together hand in hand to produce data that is better than either one of them can achieve on their own?

Speaker 1 And so even like, even today, something like writing an SAT story from scratch, even today, like sure, a couple of years ago, we might have written that story completely from scratch ourselves.

Speaker 1 But today it's just like not very efficient, right? Like you might start with a story that a model model created, and then you would edit it. You might edit it in a very substantial way.

Speaker 1 Like maybe just the core of it is very vanilla, very generic.

Speaker 1 But there's just so much cruft that is just inefficient for a human to do and doesn't really benefit from like the human creativity and human ingenuity that we're trying to add into the response.

Speaker 1 And so you can just start with like this bare bones structure that you're basically just layering on top of.

Speaker 1 And so, again, like there's like there's more sophisticated ways of thinking about scalable oversight, but just this question of how do you build the right interfaces?

Speaker 1 How do you build the right tools? How do you just combine people with AI in the right ways to

Speaker 1 make them more efficient? It's something that we build a lot of technology for.

Speaker 2 A lot of the discussion in terms of what human data the labs want has moved to

Speaker 2 RL environments and reward models in recent months.

Speaker 2 What is hard about this or what are you guys working on here?

Speaker 1 So we do a lot of work building our environments. And I think one of the things that people really underestimate is how it is how complicated it is that you can't just synthetically generate it.

Speaker 1 Like for example, you think you need a lot of tools because

Speaker 1 these are massive environments that people want.

Speaker 2 Can you give an example of like, just to make it more real?

Speaker 1 Like, imagine

Speaker 1 you are a salesperson. And when you are a salesperson, you need to be interacting with Salesforce.
You need to be getting leads through Gmail. You're going to be talking to customers in Slack.

Speaker 1 You're going to be creating Excel sheets, tracking your leads. You're going to be, I don't know, writing Google Docs and making PowerPoint presentations to present things to customers.

Speaker 1 And so you want basically these very rich environments that are literally simulating your entire world as a salesperson. Like, it literally is just like, imagine like your entire world.

Speaker 1 So with everything on your desktop, and then in the future, everything that is, you know, not on your desktop as well. Like, maybe you have a calendar.

Speaker 1 Maybe there's, maybe you need to travel to a meeting to meet a customer. And then you want to simulate a car accident happening and you're getting notified of that.

Speaker 1 So you need to leave a little bit earlier. Like all these things are things that we actually want to model in these very, very rich RO environments.

Speaker 1 And so the question is, how do you generate all the data that goes into this? Like, okay, you're going to need to generate like thousands of Slack messages, hundreds of emails.

Speaker 1 You need to make sure that these are all consistent with each other.

Speaker 1 You need to make sure that, like, going back to like my car example, you need to make sure that time is evolving in these environments and like certain like external events happen.

Speaker 1 Like, how do you do all this? And then

Speaker 1 like in a way that's actually kind of like interesting and creative, but also realistic and not. like incongruent with each other.

Speaker 1 Like there's just like a lot of thought that needs to go into these environments

Speaker 1 to make sure that they're, again, like rich, creative environments environments that the models can learn interesting things from. And so,

Speaker 1 yeah, you basically need like a lot of tools and paneling sophistication for creating use.

Speaker 2 Is there any intuition for like how real or how complex is enough? Or is it just like,

Speaker 1 you know,

Speaker 2 there's no ceiling on the

Speaker 2 realism that is useful here or the complexity of environment that is useful here?

Speaker 1 I think there's no ceiling.

Speaker 1 Like at the end of the day, you just want as like much diversity and richness as you can get because the the more richness that you have, yeah, the more the models can learn from.

Speaker 1 The longer the time horizons, the more the models can learn on and improve on. So I think there's almost an unlimited ceiling here.

Speaker 2 If you were to make a five or 10 year bet on like what scales most in terms of demand from people training AI models and types of data, is it RL environments or is it traces on types of like expert reasoning or what other areas do you think there's going to be a really large demand for?

Speaker 1 I mean, I think it will be all of the above.

Speaker 1 Like I don't think RO environments alone will suffice just because, I mean, it depends on how you think about their RO environments, but oftentimes these are very, very rich trajectories or very, very long.

Speaker 1 And so it's almost like inconceivable that a single reward,

Speaker 1 I mean, I think even today, we often think about things in terms of multiple rewards, not just a single reward, but a single reward may just may not be like rich enough to capture all of the work that goes into like the model solving some very, very complicated goal.

Speaker 1 So at the end, it will probably be a combination of all those.

Speaker 3 If you assume eventually

Speaker 3 some form of superhuman performance across different model types relative to human experts, how do you think about the role of humans relative to data and data generation versus synthetic data or other approaches?

Speaker 3 Like at what point does human input sort of run out as a useful point of either feedback or data generation?

Speaker 1 So I think human feedback will never run out. And that's for a couple of reasons.

Speaker 1 So, I mean, even if I think about the landscape today, I think people often often overestimate the role of synthetic data. Like, I personally, I think synthetic data actually is very, very useful.

Speaker 1 Like, we use it like a ton ourselves in order to supplement what the humans do. Like, again, like, like I said earlier, there's like a lot of cruft that

Speaker 1 isn't worth a human's time.

Speaker 1 But what we often find is that, like, for example, a lot of the times where customers will come to us and you'll be like, yeah, for this past six months, I've been experimenting with synthetic data.

Speaker 1 I've gathered 10 to 20 million pieces of synthetic data. Actually, we finally realized that 99% of it just wasn't useful.

Speaker 1 And so we're trying to find right now, we're trying to curate the 5% that is useful, but we are literally going to throw out 9 million of it.

Speaker 1 And oftentimes you'll find out that, yeah, like actually a thousand, even a thousand pieces of high-quality human data, highly curated, really, really high-quality human data is actually more valuable than those 10 million than those 10 million points.

Speaker 1 So that is one thing I'll say. Another thing I'll say is that.
It's almost like sometimes you need an external signal to the models. Like the models just think so differently from humans that

Speaker 1 you always need to make sure that they're kind of aligned aligned with the actual objectives that you want. Let me give two examples.
So one example is that it's kind of funny.

Speaker 1 Sometimes if you try, so one of the frontier models, let me just say that one of them.

Speaker 1 If you go use the frontier model, it's like one of the top models or one of the models I've ever got to think is one of the top.

Speaker 1 If you go use it today, like maybe 10% of the time when I use it, it will just output random Hindi characters and random Russian characters into one of my responses.

Speaker 1 So I'll be like, Tell me about Donald Trump. Tell me about Barack Oma.
And just like in the middle of it, it will just output Hindi and Russian. It's like, what is this?

Speaker 1 And the model just isn't like self-consistent enough to be aware of this. It's almost like you need

Speaker 1 an external human to tell the model that, yeah, this is wrong. One of the things I think is a giant plague on AI is LMSIS Elementarina.
And I'll skip the details for now.

Speaker 1 But I think right now, people will often.

Speaker 1 It's like if you train your model on the wrong objectives. So like the mental model that you should have of LMSIS Elimarina is that

Speaker 1 people are writing prompts. They'll get two responses, and they'll spend like five, 10 seconds looking at their responses, and they'll just pick whichever one looks better to them.

Speaker 1 So they're not evaluating whether or not the model hallucinated. They're not evaluating the factual accuracy and whether it followed any instructions.

Speaker 1 They're literally just vibing with the model and like, okay, yeah, like this one seemed better because it had a bunch of formatting. It had a bunch of emojis.
It just looks more impressive.

Speaker 1 And people will train on like, basically, an elements that's objective and they won't realize all the consequences of it. And again, like the model itself doesn't like know what its objective is.

Speaker 1 It's like you almost need like an external like quality signal in order to tell it what the right objective should be.

Speaker 1 And if you don't have that, then the model will just go in all these crazy directions.

Speaker 1 Again, like you might know, you may have seen some of the results with like the with LAMA4, but we just go in all these crazy directions that

Speaker 1 kind of mean you need these external, external validators.

Speaker 3 This also happens actually when you do different forms of like protein evolution or things like that, where you select a protein against a catalytic function or something else.

Speaker 3 And you just kind of randomize it and have like a giant library of them.

Speaker 3 And you end up with the the same thing where you have these really weird activities that you didn't anticipate actually happening.

Speaker 3 And so I sometimes think of model training as almost this odd evolutionary landscape that you're effectively evolving and selecting against, and you're kind of shaping the model into that local maxima or something.

Speaker 3 And so it's kind of this really interesting

Speaker 3 output of anything where you're effectively evolving against a feedback signal. And depending on what that feedback signal is, you just end up with these odd results.

Speaker 3 So it's interesting to see how it kind of transfers across domains.

Speaker 1 These, you know,

Speaker 1 course,

Speaker 2 as you said, five-second reaction academic benchmarks or even non-academic industrial benchmarks are easily hacked or like not the right gauge of performance against any given task.

Speaker 1 They are very popular.

Speaker 2 What is the alternative for somebody who's trying to like choose the right model or understand model capability?

Speaker 1 So the alternative that I think all the frontier labs view as gold standards is basically human evaluation.

Speaker 1 So again, proper human evaluation where you're actually taking the time to look at the response. You're going to fact check it.
You're going to see whether or not it followed all the instructions.

Speaker 1 You have good taste. So you know whether or not the model has good writing quality.

Speaker 1 Like this concept of doing all that and spending all the time to do that as opposed to just vibing for five seconds, I think actually is really, really important. Because if you don't do this,

Speaker 1 you're basically just training your models on the analog of clickbait.

Speaker 1 So I think it actually is really, really important for model progress.

Speaker 2 If it's not LM sys, like how should people

Speaker 2 actually evaluate model capability for any given task?

Speaker 1 What all differentiable labs find is that human evals really are the gold standard. Like you really need to take a lot of time to fact check these responses, to verify to follow the instructions.

Speaker 1 You need people with good taste to evaluate the writing quality and so on and so on. And if you don't do this, you're basically training your models on the analog clickbait.

Speaker 1 And so I think that really, really harms model progress.

Speaker 2 Is there work that Surge is doing in this domain of like trying to standardize human eval or make it more transparent to end consumers of the API or even users?

Speaker 1 So internally, we do a lot of work actually today with working with all the frontier labs to help them understand our models. So again, we're constantly evaluating them.

Speaker 1 We're constantly surfacing loss areas for them to approve on and so on and so on.

Speaker 1 And so right now, a lot of this is internal, but one of the things that we actually want to do is start external forms of this as well, where we're helping educate people on, yeah, like these are the different capabilities of all these models.

Speaker 1 Here, these models are better at coding. Here, these models are better at instruction following.
Here, these models are actually halosinating a lot. So you just didn't trust them as much.

Speaker 1 So we actually do want to start a lot of external work to help educate the broader landscape on this.

Speaker 2 If we can zoom in and talk just about the

Speaker 2 larger, like competitive landscape and what happens with frontier models over time, what does the meta scale deal mean for you guys? Or what do you make of it?

Speaker 1 So I think it's kind of interesting in that. So we were already the number one player in the space.
It's been beneficial because, yeah, there were still some legacy teams using scale.

Speaker 1 Like they just didn't know about us because we were still pretty under the radar. I think it's been beneficial because one of the things that we've always believed is that

Speaker 1 sometimes when you use these low quality data solutions, people kind of get burned on human data. And so they had this negative experience.
And so then they don't want to use human data again.

Speaker 1 And so to try these other methods that are honestly just a lot slower and

Speaker 1 don't come with the right objectives. And so I think it just harms model progress overall.
And so it's just like the more and more we can get all these frontier labs using high quality data.

Speaker 1 I think it actually really, really is beneficial for an an industry as a whole. So I am,

Speaker 1 I think overall, it was like a good thing to happen.

Speaker 2 If you were to

Speaker 2 make a bet that an underdog catches up to OpenAI, Anthropic, and DeepMind, who would it be?

Speaker 1 So I would bet on XAI.

Speaker 1 I think they're just very hungry and mission-oriented in a way that gives them a lot of really unique advantages.

Speaker 2 I guess maybe another sort of broader question is, do you think there's three competitive frontier models, 10 competitive frontier models a couple of years from now. And is any of those open source?

Speaker 1 Yeah. So I actually see more and more frontier models opening up over time because I actually don't think that the models will be commodities.

Speaker 1 Like, I think one of the things that we've, I mean, I think one of the things that has actually been surprising the past couple of years is that you actually see all of their models have their own focuses that give them unique strengths.

Speaker 1 Like, for example, I think Anthropic's obviously been really, really amazing at coding and enterprise. And OpenAI has this big consumer focus because of ChatGPD.
Like, I actually really love it.

Speaker 1 It's models personality. And then Croc, you know, it just has a different set of things that's willing to say and to build.

Speaker 1 And so it's almost like every company has, it's almost like a different set of principles that they care about. Like they're like some will just never do one thing.

Speaker 1 others are totally willing to do it.

Speaker 1 Others just have different, like models will just have so many different facets to their personality, so many different facets to the type of skills that they will be good at.

Speaker 1 And sure, like eventually AGI will maybe encompass this, this all. But in the meantime, you just kind of need to focus.
Like there's only so many focuses that you can have as a company.

Speaker 1 And so I think that just will lead to like different strengths for all the model providers.

Speaker 1 So, I mean, I think today, you know, we already see like a lot of people, including me, if we will switch between all the different models, just depending on what we're doing.

Speaker 1 And so in the future, I think that will just happen even more as

Speaker 1 people are just using more and more models for using models for different aspects of their lives, like both their personal and their inner like professional lives.

Speaker 2 Going back to something Elad mentioned, like where should we expect to see like surge investing over time? Like what do you think you guys will do a few years from now that you don't do today?

Speaker 1 Again, I think I'm really excited about this more

Speaker 1 kind of like public research push that we're starting to have.

Speaker 1 Like I think it is really interesting in that a lot of the like for obvious reasons, a lot of the frontier labs, they're just not publishing anymore. And as a result of that, I think

Speaker 1 It's almost like the industry has fallen to kind of a trap that I worry about. So like maybe maybe to dig into some of the things I said earlier

Speaker 1 with some of the negative incentives of the industry and some of the kind of concerning trends that we've seen.

Speaker 1 So like going back to LMSIS, one of the things that we'll see is like a lot of researchers, they'll tell us that their VPs make them focus on increasing their rate on LMSIS.

Speaker 1 And so I've had researchers explicitly tell me that they're okay with making their models worse.

Speaker 1 at factuality, worse at following instructions, as long as it improves their ranking, because their leadership just wants to see these metrics go up.

Speaker 1 And again, that is something that literally, literally happens because the people ranking these games on Ellen Sys, they don't care whether the models are good at instruction following.

Speaker 1 They don't care whether their models are

Speaker 1 emitting like factual responses.

Speaker 1 What they care about is, okay, did this model emit a lot of emojis? Did it emit a lot of bold words? Did it have really long responses? Because that's just going to look more impressive to them.

Speaker 1 Like one of the things that we found is that the easiest way to improve your rank on Alam Arena is literally to make your model response longer.

Speaker 1 And so what happens is, like there are a lot of companies who are trying to improve directly to board rank.

Speaker 1 So they'll see progress for six months because all they're doing is unwittingly making their model responses longer and adding more emojis.

Speaker 1 And they don't realize that all they're doing is training their models to produce better clickbait.

Speaker 1 And they might finally realize six months or a year later, like again, you may have seen some of these things in industry, but it basically means that they spend the past six months making zero progress.

Speaker 1 And in a similar way, I think, you know, besides LMS, you have all these academic benchmarks and they're completely diverse from the real world.

Speaker 1 Like a lot of teams are focused on improving these SAT style scores instead of real world progress.

Speaker 1 I'll give an example. There's a benchmark called IFEVAL.

Speaker 1 And if you look at IFEVAL, so it stands for instruction following eval. If you look at IFEVAL, like some of the instructions are trying to check what our models can do, it's like,

Speaker 1 hey, can you write an essay about Abraham Lincoln? And every time you

Speaker 1 mention the word Abraham Lincoln, make sure that five of the letters are capitalized and all the other letters are uncapitalized. It's like, what is this?

Speaker 1 And sometimes we'll get customers telling us, like, yeah, like, we really, really need to improve our our score on IFE value.

Speaker 1 And what this means is, again, like you have all these companies or all these researchers who instead of focused on real world progress, they're just like optimizing for these silly STT style benchmarks.

Speaker 1 And so one of the things that we really want to do is just think about ways to educate the industry, think about ways of publishing on our own, just like think about ways of steering the industry into like, hopefully a better direction.

Speaker 1 And so I think that's just one, one big thing that we're really excited about and could be, could be really big in the next five years. Okay.
Yeah.

Speaker 3 I mean, mean, so Sarah brought up earlier how everybody kind of wants high quality data.

Speaker 3 What does that mean? How do you think about that? How do you generate it? Can you tell us a little bit more about your thoughts on that?

Speaker 1 So let's say you wanted to train a model to write an eight-line poem about the moon.

Speaker 1 And so the way most companies think about it is, well, let's just hire a bunch of people from Craigslist or through some recruiting agency and let's ask them to write poems.

Speaker 1 And then the way they think about quality is, well, is this a poem? Is it eight lines? Does it contain the word moon? If so, like, okay, yeah, I hate these three checkboxes.

Speaker 1 So yeah, sure, this is a great poem because it follows all these instructions. But if you think about it, like, the reality is you get these terrible poems.

Speaker 1 Like, sure, it's eight lines and it has the word moon, but they feel like they're written by kids from high school.

Speaker 1 And so, other companies feel like, okay, sure, these people on Craigslist don't have any poetry experience. So, what I'm going to do instead is hire a bunch of people with PhDs in English literature.

Speaker 1 But this is also terrible. Like, a lot of PhDs, they're actually not good writers or poets.
Like, if you think of, like, think of Hemingway or Emily Dickinson, they definitely didn't have a PhD.

Speaker 1 I don't think they even completed to college. And, like, one of the things I will say is, like, yeah, I went to MIT.
I think Evive, you went, you went there too.

Speaker 1 And a lot of people I knew from MIT who graduated with a CS degree, they're terrible coders. And so we think about quality completely differently.

Speaker 1 Like what we want isn't poetry that checks some boxes. And like, okay, yeah, they check these three boxes and uses some complicated language.

Speaker 1 We want a type of poetry that Nobel Prize laureates would write. And so what you want is like, okay, we want to recognize that poetry is actually really subjective and rich.

Speaker 1 Like maybe one poem, it's a haiku about boom white on water.

Speaker 1 And there's another poem that's like, it has a lot of intradow rhyming meter and another one that i don't know focused on the emotions behind the moon rising at night and so you actually want to capture that there's thousands of ways to write a poem about the moon there isn't a single correct way and each one gives you all these different insights into language and imagery and poetry And if you think about it, it's not just poetry, it's like math.

Speaker 1 There's a thousand ways probably to prove the Pythagorean theorem. And so I think the difference is that when you think about quality the wrong way, you kind of get commodity data that

Speaker 1 optimizes for things like inner radar agreement. And again, checking boxes is also some some list.

Speaker 1 But one of the things that we try to teach all of our customers is that high quality data actually really embraces human intelligence creativity.

Speaker 1 And when you train the models on just like richer data, they don't just learn to follow instructions.

Speaker 1 They really learn all these deeper patterns about all the stuff that makes language in the world really compelling and meaningful.

Speaker 1 And so I think a lot of companies, they just throw humans at the problem and they think that you can get good data that way.

Speaker 1 But I think you really need to think about quality from first principles and what it means.

Speaker 1 And you need a lot of technology to identify, yeah, that these are amazing poems and these are creative math problems.

Speaker 1 and these are games and web apps that are beautiful and fun to play, and these ones are terrible to use.

Speaker 1 So, I think you really need to build a lot of technology and think about quality in the right way. Otherwise, you're basically just like scaling up mediocrity.

Speaker 2 That sounds very domain-specific. So, do you like in every domain, are you building a lens of what quality looks like along with your partners?

Speaker 1 Yeah, I mean, I think we have kind of like holistic quality principles, but then oftentimes there are differences per domain. So, it's like a combination of both.
I think we got all the core topics.

Speaker 2 Nice work on podcast number two. Edwin, and thanks for doing this.
Congrats on all the progress with the business.

Speaker 3 Yeah, no, thanks so much for having us.

Speaker 1 Yeah, it's great, great meeting you guys.

Speaker 2 Find us on Twitter at no priors pod. Subscribe to our YouTube channel if you want to see our faces.
Follow the show on Apple Podcasts, Spotify, or wherever you listen.

Speaker 2 That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.