163. The Data Sleuth Taking on Shoddy Science
Listen and follow along
Transcript
It's Stock Up September at Whole Foods Market.
Find sales on supplements to power up for busy weeks.
Plus, pack your pantry with pasta, sauce, and more everyday essentials.
Enjoy quick breakfast for less with 365 by Whole Foods Market seasonal coffee and oatmeal.
Grab ready-to-heat meals that are perfect for the office and save on versatile no antibiotics ever chicken breasts.
Stock up now at Whole Foods Market, in-store and online.
What does it mean to live a rich life?
It means brave first leaps, tearful goodbyes,
and everything in between.
With over 100 years' experience navigating the ups and downs of the market and of life, your Edward Jones financial advisor will be there to help you move ahead with confidence.
Because with all you've done to find your rich, we'll do all we can to help you keep enjoying it.
Edward Jones, member SIPC.
As an academic, I did a lot of research trying to catch bad actors.
Everyone from cheating teachers to terrorists to sumo wrestlers who are throwing matches.
What I didn't do much of though was to try to catch cheating academics.
My guest today, Yuri Simonson, is a behavioral science professor who's transforming psychology by identifying shoddy and fraudulent research.
Watching his exploits makes me wish I'd spent more time doing the same.
We talk about red flags versus smoking guns.
So red flags that gives you like probable cause, so to speak.
But it's not enough to raise up an accusation of fraud.
That's a smoking gun.
Welcome to People I Mostly Admire with Steve Levitt.
Yuri Simonson and two other academics, Joe Simmons and Laif Nelson, run a blog blog called Data Collada, where they debunk fraud, call out cheaters, and identify misleading research practices.
Yuri on his own has been doing this work for over a decade.
My Freakonomics friend and co-author Stephen Dubner spoke to the Data Collada team for a series about academic fraud that ran on Freakonomics Radio in 2024.
But I admire Yuri and his collaborators so much that I wanted the chance to talk to him myself.
I started our conversation by asking about the research study that got him started in this direction.
The study that he read and said to himself, my God, this is outrageous.
I just can't take it anymore.
The first time I ever did any sort of commentary or criticism, I was asked to review a paper for a journal.
What they were studying was the impact of your name on your big life decisions.
The paper began with something like, it's been shown that people with the same initial are more likely to marry each other than expected by chance.
In this paper, we check whether they have similar divorce rates or something like that.
And the idea was, oh, maybe if you marry somebody just because they have your same initial, you're less compatible than if you follow more traditional motivation for marriage.
And I thought, no way.
And so I stopped reviewing and I went to the original paper.
And it's not that I thought there's no way that your name impacts these decisions.
I could imagine some mechanisms where like you're talking to somebody and then they happen to share a name with you and that can be a nice icebreaker and it would lead to a relationship.
But what I thought is how in the world do you study this credibly?
And so that led me to an obsession.
I went to the regional and I thought, okay, clearly it's going to be some ethnic confound, for example.
Because different ethnicities have different distributions of last names, right?
So very few South American people have a last name starting with a W, but many Asian people do.
And so because Asians marry Asians and South Americans marry South Americans, that could explain it.
But it was better than that, the original study.
So it took me a while.
And then I figured out what the problem was.
The very first one I checked was with the same initial last name.
So the idea is that if your last name starts with an S, you're more likely to marry somebody else whose name starts with an S.
And what I found was that the effect was driven entirely by people who have the exact same last name.
So I thought, why would people have the same last name and be more likely to marry?
Like, how would that happen?
And a common mechanism is that there's a couple, they get married, she changes her last name to his, they divorce, and then they remarry each other.
Now, this is rare.
This is rare.
But because you expect so few people to marry somebody else with the same last name, it's such a huge coincidence that even a small share of these people, they can generate an average effect that's sizable.
Oh, yeah, that's great.
That's clever.
So later, you and Joe Simmons and Leif Nelson published a paper in 2011 called False Positive Psychology.
And it turned out to be an incredibly influential paper.
You and your co-authors highlight specific practices commonly done by researchers that can lead to drawing the wrong conclusions from a randomized experiment.
The core of the paper is really made up of simulations that show quantitatively how various researcher choices lead to exaggerated levels of statistical significance.
So it appears in a published paper that a hypothesis is true, even when it really isn't.
What was the motivation behind writing that paper?
Me, Leif, and Jode, we were going to conferences and we were not believing what we were seeing and we were sticking with our priors.
And so then what's the point?
You should read a paper that surprises you and you should update.
It doesn't mean you should believe with certainty it's true, but you should update.
You should be more likely to believe it's true than you did before reading the paper.
And we were not experiencing that.
It is one of the few academic papers that has caused me to actually laugh out loud because as part of that paper, you describe in a very serious way an actual randomized experiment you yourself ran in which you find that listening to the song When I'm 64 by the Beatles actually causes time to reverse.
Will you still need me?
Will you still feed me when I'm 64?
People who listen to that song, they're almost 1.5 years younger after they listen to the song than before.
And obviously, that makes no sense at all, which is the whole point.
But you report the results in the same scientific tone that pervades all academic papers.
And I found it to be hysterically funny.
So let me start by giving my best attempt to describe the textbook version of a randomized experiment.
That's the gold standard of scientific inquiry.
Here's my attempt.
The researcher starts by posing a clear hypothesis that he or she wants to test.
So in your When I'm 64 paper, this hypothesis would be that listening to that song causes time to run in reverse, leaving people who listen to it younger after they listen to it than before.
And then the researcher poses a second alternative hypothesis called the null hypothesis, to which that first hypothesis is compared.
In this case, the null hypothesis would probably be that listening to When I'm 64 does not cause time to reverse.
Right.
Then the researcher maps out a detailed experimental protocol to test these two competing hypotheses.
And then using very simple high school level statistics, you determine whether there are any statistically different changes in ages across the subjects in the two groups.
And if I ran that experiment as described, you would be inclined to believe my results, whatever they were.
That's right.
Okay.
So let's talk now about specifically how you used standard methods employed in the psychology literature at the time to prove that this Beatles song reverses time.
The first common practice you talk about is measuring the outcome you care about in multiple ways, but only reporting results for the outcome variable that yields the best results.
So all you need to do is give yourself enough chances to get lucky.
We can think of p-values that you have something insignificant.
There's like a one in 20th chance that you can get lucky.
Right.
So the p-value p-value refers to the probability value or the likelihood of observing the effect you're studying.
In academic publishing, for reasons I don't really fully understand, we've anointed the 5% level of statistical significance as some kind of magic number, right?
So if your story is not really true, you'd only get the data that look like this less than 5% of the time, then that somehow magically leads people to say that your theory is true.
If it's above 5%, then we tend to say, oh, you haven't proven yet that your theory is true.
So, like, suppose you have a friend who says, I can forecast basketball games, and they get a right for the one game.
You're like, well, it was 50-50 that you get a right, so I'm not impressed.
So, then they get two games in a row.
It's like, oh, okay, that's more surprising.
It's only a 25% chance that if you were just tossing a coin, you would get two basketball game guesses correctly.
But you're still not sufficiently convinced.
But when they do five in a row, then you think, oh, maybe they can actually forecast basketball games because it would be so unlikely that just by chance you'll get five in a row that I guess the alternative becomes my candidate theory.
So I guess you can predict basketball games.
That's the logic of that.
Like at some point, you're forced to get rid of just chance as the explanation.
Okay.
So in this When I'm 64 experiment, what you're saying is it was like a friend asked you to predict the outcome of five basketball games in a row.
But what you secretly did in the background is you actually predicted the outcome of five basketball games, not once, but maybe 100 times.
You had 100 different series of five basketball games, and there was one of them out of those 100 that actually gave you this crazy result.
And then you reported, that's what you got.
Yeah.
Like imagine your friend said, okay, I'm going to predict five basketball games, but they're also predicting five baseball games and five football games and five whatever games.
And then whichever one worked is the one they tell you about.
And the ones that didn't work, they don't tell you about.
So if you have enough.
dice, even if it's a 20-sided die, like only one in 20 chance, if you keep rolling that die, eventually it's going to work out.
And so the way academics, our researchers in general, can throw multiple die at the same time and be able to get the significance they're looking for is they can run different analysis, which is what we did there.
So we were comparing participants' age across the two groups.
So that's one die, but we actually had three groups.
So we had when I'm 64, we had a control song, which is a song that comes with Windows as a default in the background.
And we also had the hot potato song, which is a kid's song.
And so that could have had the opposite effect.
So we had three.
We could have compared hot potato to control, or we could have compared hot potato to when I'm 64, or when I'm 64 to control.
So right there, we have three dice that we were throwing.
But we also could adjust our results.
So the one we ended up publishing was controlling for how old the participants' fathers were.
Okay, and the logic was, look, there's a lot of natural variation in how old people are.
And so to be able to more easily detect differences induced by our manipulation, we want to control some of that noise, right, to take that into account.
And so, one way to take people's age into account indirectly is to ask how old their parents are.
And so, in our regression, statistically, we take that into account.
And when we did that, the effect arose.
And why?
Because if you do that, now we have three more dice, right?
We can do controlling for father's age, hot potato versus control, controlling for father's age, hot potato versus when I'm 64, and so on.
And so, in the end, we have like many, many different ways we could cut the data.
And the one that worked is the one we ended up writing about.
The way you just described it, it is completely and totally obvious that you're cheating if you test a bunch of outcomes and then you just choose not to tell people about the ones that don't work and you focus all your attention on the one that does work.
You're obviously misleading people into believing your hypothesis is more powerful than it really is.
So how could the academics not realize this was bad science?
Do you think they really didn't understand?
that this was cheating?
I do.
I do because I had many conversations where people were pushing back and saying there's nothing wrong with it.
I think there's two ingredients to it.
One of it is just not knowing the statistics.
Most people who take statistics don't learn statistics.
For some reason, it's profoundly counterintuitive to humans.
It's just not how we think.
And the other reason is we're very good storytellers.
And so what happens is the moment you know what works, you immediately have a story.
for why it makes sense.
I remember the first time I presented the 1M64 study, somebody in the audience asked a question jokingly about some decision we had made.
My instinct was to immediately defend it.
Like, we are just so trained, that's what you do.
So I don't think people were cheating in the sense that they thought it was wrong.
They just didn't know and they didn't quite appreciate the consequences.
I just want to say, it's not just psychology.
This is very common in clinical research.
If somebody is running an experiment, It can be in medicine and economics and biology.
Like at the time, I was talking to scientists from all fields.
And this is a very widespread problem.
Okay, so that's good to point out because one could easily say, well, I'm not that worried if psychologists are messing around.
But when medical researchers are messing around, now you're actually getting into things people really care about.
Okay, let's talk about the second misleading research practice that you highlight.
And this one's a lot more subtle than the one we just talked about.
The researcher designs an experiment and carries it out.
And then he or she looks at the data and sees that the results, oh, they're not quite statistically significant.
Everything's going in the right direction, but it didn't quite reach this magic 0.05 threshold.
So it seems sensible in that situation.
You say, well, look, I just didn't have enough power.
That's what we call it in experimental design when you don't have enough research subjects to actually show that your true hypothesis is really true.
You don't have enough power.
And so maybe I'll just go and add another 15 or 20 observations and I'll see if it's significant.
Oh, and maybe again, it wasn't quite significant.
I'll add 20 more.
Boom, I'm over the threshold and then I stop.
Now, intuitively, this doesn't seem nearly as bad to me as not reporting all the outcome variables.
But as you show, my intuition is wrong.
This is actually a really bad practice.
Can you try to explain why in a way people can understand?
Yeah.
Let's think of a sport.
Let's say tennis.
And let's say you're playing tennis with somebody who's similarly skilled as you are.
And so beforehand, if you had to guess who's going to win, it's like a 50-50 chance.
But suppose we change the rules and Steve gets to say when the game ends.
We don't play to three sets.
We play to whenever you want to end it.
Okay.
And you're one of the two players.
You may see that now you're much more likely to win the match.
Yeah, if I win the first point, match over.
That's right.
So if at any point during the game, you are ahead, you win the game.
And therefore, now the probability is not that you win after three sets, but it's that you're ahead at any point in the game.
And that necessarily has to be much more likely.
And so similarly with an experiment, when we do the stats, what the math math is doing in the background is saying, well, if you commit it to 60, how likely is it that after 60, you will have an effect this strong?
But what we should be asking is, what is the likelihood that at any point up to 60, your hypothesis will be ahead by a lot?
That's the question we should be asking.
And that's necessarily much more likely.
Yeah.
And the key is that you stop when you win and you keep on going when you're losing.
That's what introduces the bias.
It's not a random decision.
If you were to flip a coin, do I keep collecting subjects or not?
Then there will be no problem.
The problem is the way you were said it.
If you're losing, you keep playing, but if you're winning, then you end.
It's unlike the first point where the academics I would talk to about having multiple outcomes, they totally got why that wasn't legit.
But I still can have conversations.
with experimentalists who will argue with me about this point and say, I'm dead wrong.
How can I not understand this?
This is a good research practice, not a bad research practice.
But as you show in the paper, and as the intuition you just described explains, it's really a bad practice.
We're not good intuitive thinkers about statistics, especially about conditional probability, which is that it has that flavor, and that's the source of the problem.
We'll be right back with more of my conversation with behavioral scientist Yuri Simonson after this short break.
People I Mostly Admire is sponsored by NetSuite.
It's an interesting time for business.
Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever.
If your business can't adapt in real time, you're in a world of hurt.
You need total visibility from global shipments, tariff impacts, to real-time cash flow.
That's NetSuite by Oracle, your AI-powered business management suite trusted by over 42,000 businesses.
NetSuite is the number one cloud ERP bringing accounting, financial management, inventory, and HR into one suite.
You have one source of truth, giving you the control you need to make quick decisions.
With real-time forecasting, you're peering into the future with actionable data.
And with AI embedded throughout, you can automate everyday tasks, letting your teams stay strategic.
NetSuite helps you know what's stuck, what is costing you, and how to pivot fast.
If your revenues are at least in the seven figures, download the free ebook Navigating Global Trade: Three Insights for Leaders at netsuite.com/slash slash admire.
That's NetSuite.com slash admire.
People I Mostly Admire is sponsored by LinkedIn.
As a small business owner, your business is always on your mind.
So when you're hiring, you need a partner who's just as dedicated as you are.
That hiring partner is LinkedIn Jobs.
When you clock out, LinkedIn clocks in.
They make it easy to post your job for free, share it with your network, and get qualified candidates that you can manage all in one place.
And LinkedIn's new feature can help you write job descriptions and then quickly get your job in front of the right people with deep candidate insights.
You can post your job for free or choose to promote it.
Promoted jobs attract three times more qualified applicants.
At the end of the day, the most important thing to your small business is the quality of candidates.
And with LinkedIn, you can feel confident that you're getting the best.
Post your job for free at linkedin.com/slash admire.
That's linkedin.com/slash admire to post your job for free.
free.
Terms and conditions apply.
People I Mostly Admire is sponsored by Mint Mobile.
From new shoes to new supplies, the back-to-school season comes with a lot of expenses.
Your wireless bill shouldn't be one of them.
Ditch overpriced wireless and switch to Mint Mobile where you can get the coverage and speed you're used to, but for way less money.
For a limited time, Mint Mobile is offering three months of unlimited premium wireless service for $15 a month.
Because this school year, your budget deserves a break.
Get this new customer offer and your three-month unlimited wireless plan for just $15 a month at mintmobile.com slash admire.
That's mintmobile.com slash admire.
Upfront payment of $45 required, equivalent to $15 a month.
Limited time, new customer offer for first three months only.
Speeds may slow above 35 gigabytes on unlimited plan.
Taxes and fees extra, see Mint Mobile for details.
What I find so beautiful about this paper is that it is really so simple.
It's so easy to understand.
It's so obvious in a way.
And yet a whole field of academics was totally blind to it until you pointed it out.
And at least the way I've heard the story told, you and your co-authors didn't think you'd even be able to publish the paper, much less imagine that it would emerge as one of the most important papers published in psychology in the last two decades.
We thought it was uncitable because we thought, how can you cite this paper?
In what context?
Like you would say, well, we didn't do this weird thing that they talked about, citation.
So we thought, okay, maybe it'd be influential, it'd be hard to publish, and it'd be uncitable.
And it's incredibly cited, like crazy.
We were super wrong.
The bottom line, which is really stunning, I think, to most people who read your paper, is in your simulations, what you find is that when you run through a hypothesis that is not true by design, you've built these hypotheses not to be true, but you do all of these different tweaks together, then over 60% of the time, you get statistically significant results for your hypothesis.
Okay, these are 100% false hypotheses that 60% of the time lead to the truth.
That's crazy.
It It surprised me.
Did it surprise you how big that number was when you first ran the simulations?
We were floored.
That's when we decided we definitely have to do the paper.
Yeah, so I've had a lot of psychologists as guests on this show, and they have reported some truly remarkable findings.
And I suspect I should have been more skeptical of them than I was.
But it's also odd to only believe research that confirms your beliefs.
It's a hard line to follow.
I guess it's why we so desperately need credibility in research is that when research is not credible, then you just default to your own intuition.
But if you're just defaulting to your own intuition, like you go back to Socrates and Aristotle, you're no longer empirically driven.
One of the things we've been doing is advocating for pre-registration, which means people tell you how they will analyze the data before they actually run the study.
So closer to the way you were describing the ideal experiment.
And they have been substantive uptake of this idea of pre-registration.
So when you see the results, you can evaluate the evidence much closer to the face value of what the statistics tell you.
So in that initial paper, you laid out a simple set of rules for how to create a body research that's more credible.
And one of them is this pre-registration.
Another really simple one is making your raw data available.
I think this will amaze people who are outside of academics.
But until recently, until after what you did, and in large part probably because of what you did, academics were not expected to let others see their raw data.
And that has really been transformational, I think.
Don't you?
Yes, that's very important.
It's easier to check for errors.
It's easier to check for fraud if one is so inclined.
And it's easier to even just look for robustness, like the idea that, oh, yeah, you get it with this particular model, but let me try something else the way I usually analyze the data.
Do I also get it that way?
So that's become much more common.
I wouldn't take too much credit for that.
The internet is probably a big source of why it's just easier to upload and share than it used to be.
I'm going to say something controversial now.
I have the sense that part of the reason that researchers 10 or 15 years ago were behaving so unscientifically, and still researchers are pretty unscientific, is that at some fundamental level, nobody really cares whether the results are true or not.
I get the sense that most social scientists see academic publishing as a game of sorts.
The goal is to get publications and citations and tenure.
And there's an enormous number of academic papers written each year and a nearly infinite supply of academic journals.
So in the end, very low quality stuff ends up getting published somewhere.
And except for a handful of papers, there's little or no impact on the broader world that comes out of this research.
So it just isn't so important whether the results are right.
But when I've bounced that hypothesis off of other academics, they used to get really mad at me when I say that, although I do believe there's a lot of truth to it.
What do you think?
I think there's truth to it.
There's definitely people who don't care whether it's true or not, because what they're doing is maybe game is too strong a term, but like their job is to publish papers.
Their jobs is not to find new truths that other people can work with.
At the same time, because in our blog, we've done 130 or so, and we focus on at least some criticisms of papers.
And we have a policy where we engage with the authors before we make anything public.
We send them a draft and we ask them for feedback.
Nobody likes it.
Nobody wants to hear from you.
That's a disaster.
So excited to learn what the shortcomings of this paper were.
Like nobody's in that mood.
But beyond that, they seem to really care.
Now, it could be, they just don't want to be shown to have been wrong.
And there's some truth to that.
But they do seem to really care about the phenomenon.
I agree with you that there's a lot of people that don't care.
But I think the higher up you go on the ladder of more influential researchers in more influential journals, I think they do care.
In fact, if anything, I think they have an inflated sense of how important their work is, not the opposite.
They think their work is really changing the world.
They don't think of it as a game.
They think their life is so important because they're really changing things.
And part of my sort of motivation is I don't agree with that.
I agree with you in the sense that I think most research is insufficiently valuable.
Even most top research is insufficiently valuable.
And maybe this is too naive of me, but I think if you make it harder to publish silly things and to publish like sexy findings, the only hope then is to study something that's important.
Even if it's not intrinsically interesting, it's going to be more important.
And so to move social science to be more of a force for greater welfare in society, I don't think we're there.
I don't think social science is all that useful at the moment, but I do think it has the potential.
So we've been talking so far
about how standard methodological practices can lead readers to falsely conclude that a hypothesis is true because the way things are represented are misleading.
Okay, that only gets you so far in the abstract.
What we really need is an actual tool that one can apply to a specific body of research to reliably judge whether the findings are credible.
And damn, you and Joe and Leif, you came up with that too.
It's called the P-curve.
And it is simultaneously, again, incredibly brilliant and incredibly simple.
Can you give the intuition that underlies the p-curve?
Yeah.
So what it looks like is any one study is going to have a p-value.
It's going to tell you how surprising the findings are.
And remember, we're saying if something's 0.05, it's significant.
We're drawing this arbitrary line at 0.05, right?
And so if you see one study ending at 0.04, that's okay.
That's significant.
If you see one study at 0.06, that's not significant.
But the key, insight is, and it's related to stuff I had done with motivation and goals, is that if you're aiming for a goal, you're going to end pretty close to it.
So, for example, if you're aiming to run a marathon in four hours,
you're not going to run it in three and a half hours.
You're going to run it at 358, 359, because that's your goal.
The moment it's achievable, you stop going.
And so, the basic idea of P-curve is if people are trying multiple things to get to 0.05, they're not going to go all the way to 0.01 or to 0.001.
They're going to stop once they achieve it.
So, you start and your p-value is 0.12.
You know you need a 0.05 and so you try something you control for father's age, right?
And then that gets you to 0.08 and then you drop the hot potato condition and you end up at 0.04 and then you stop.
You don't keep going because you achieved your goal.
So if I see a bunch of results and they're all 0.04s, then it becomes more likely that you are p-hacking.
A lot of academics across all fields now use this term p-hacking, which is about how you selective report from all the analysis you did.
If there's a true effect that is significant, you expect very few 0.04s and you expect mostly 0.01s.
If you read the literature, you don't see a lot of 0.01s.
You see a lot of 0.04s and 0.03s, right, which tells you something.
So if you give me 20 studies and I look at the p-values and I see 0.04, 0.03, 0.04, 0.03,
I should not believe those studies.
And if I see 0.01, 0.01, 0.02, I should believe them.
And so PKR just formalizes that.
So it takes all the p-values, it does a magic sauce, and it tells you the combination of the results that that are close to 0.05 compared to far from 0.05 tells you you should believe it, you should not believe it, or you need more data to know.
Okay, so given this idea of the p-curve, in practice, have you been able to debunk whole literatures by using this concept and show that things that people believed and where lots of papers were published are probably not true at all?
I've only done one of that flavor, and it was controversial.
I think now it's not so controversial.
It was the power posing literature.
It was a very influential TED talk.
People knew that if you assume a powerful pose, meaning you expand your body, like imagine somebody raising both of their arms and standing up, then they become more confident, more successful.
They would take more risk and things like that.
And we applied PCURB to that literature and we found that there was no evidence for it.
Maybe 30 published papers on it.
And if you look at them, they provided no evidence.
So we've been talking so far mostly about mistakes, possibly well-intentioned people make that are doing research that seem to have truth in the back of their mind, if not actually taking the steps that are getting them to the truth.
But where I think what you've done that gets a lot more fun and exciting is going after complete frauds, researchers who are actually outright making up data or faking experiments.
So of all these fraud cases that you've been part of, the granddaddy granddaddy of them all is the Francesca Gino case.
Do you want to tell me about that?
Sure.
A few years ago, maybe five years ago, four years ago, we were approached by young researchers who wanted to remain anonymous with concerns.
I had been in touch with them a few times about fraud.
The first paper I had done detecting fraud was about 14 years ago.
And so I had sort of moved on and I didn't want to do it anymore.
And so this person would approach me and I told them, look, unless the person is very famous or the paper is very influential, I don't want to get involved.
It's very draining.
Okay, wait.
So you had done some fraud research and then you swore off it.
To an outsider, it might seem like, wow, of all the things that might be really fun and exciting as an academic, it would be revealing some horrible fraud who's doing terrible things and ruining the profession.
But you're saying you didn't actually enjoy that kind of work.
No, I hated it.
I hated it because the first two days are fun and the next year are dreadful.
Because the first two days are the discovery process where you're actually in the data, you uncover the patterns, and then the rest is the drudgery of being 99.999% sure because if you're wrong, you are really in trouble.
It's not enough to be right in your head with 100% certainty.
You need to be certain that others will be certain.
And it's also a case where you have an adversary, right?
Mostly when you do academics, you write something and nobody really cares that much.
But when you are saying someone else made up their data, you have created an enemy who will fight you to the death.
But it's also draining because you become like the world expert in an incredibly trivial small piece of information.
Like, as we will talk in a minute, Francesca Gino, we spent a lot of time on her Excel spreadsheets for data.
I'm like the world expert in study three
that she ran, you know, 14 years ago.
I know every row, and that's just really useless.
We've talked about like, is research useful or not useful?
It's debatable, but like knowing how a particular spreadsheet was generated, it just feels so local.
You're not learning anything.
It's not fun.
You get a lot of pushback.
You had done some research.
You'd found fraudsters.
And in part because you had a reputation for doing this, people would come to you with hot tips, with the idea that fraud was going on.
So a stranger came to you with the belief that Francesca Gino was cheating on her research.
And this is especially interesting because Francesca Gino is one of your own co-authors in the past, which must have put you in a really interesting and complicated place.
Yep.
So we have a paper together.
I used to be at Wharton.
I'm in Spain now.
And we made her an offer when I was there.
She ended up going to Harvard instead.
So I knew her.
And we did have suspicions maybe 10 years ago.
And we looked at the data that we had access to.
We were subjectively convinced that something was wrong with it.
but we didn't think we could prove it beyond reasonable doubt.
And so we dropped it.
So then this young academic came to you and she had better evidence.
She convinced you that you thought you could actually make a case of it.
Yeah.
It was two of them and they sent me a report they had written.
And I thought, this is promising.
We talk about red flags versus smoking guns.
So red flags is something, your experience with it, it just doesn't seem right.
That gives you like probable cause, so to speak, but it's not enough to raise an accusation of fraud.
That's a smoking gun.
I think that's where the report was at that stage.
And then we said, can we get evidence that is sufficiently compelling that anybody looking at it would be immediately convinced.
Your data colada blog team is you and Joe and Leif.
How many hours do you think the three of you put together on top of what these other two folks had done to try to push it to that stage?
Hundreds.
Yeah.
So big, big investment.
And it's not even so obvious why you're doing it.
Probably did in the end further your academic career and bolster your reputation.
But mostly this is a task that isn't rewarded in academics very much.
It's funny, like it definitely helped my policymaking career.
I'm actively engaged trying to change things in social science.
This definitely made it easier to do that.
I don't know that it made it easier publishing my research that is not about fraud because people are happy that whistleblowers exist, but nobody likes whistleblowers.
It doesn't engender like warmth.
You know, I'm happy you exist, but I'm going to talk to somebody else during my lunch.
But for my intentions to try to influence science and to have more credible research, this has been very good.
We've received funding.
We're about to launch a new platform that seeks to prevent fraud instead of detecting it.
And that only was made possible by the attention that this case received.
Okay, so just to foreshadow what's going to happen.
So she is going to be
fired from Harvard, her tenure removed, as a result of this evidence you're collecting.
So just to give listeners a flavor, what's the clearest evidence of all the things that you found in her data?
What to you said, my God, no way this could happen anyway except for outright fraud?
My favorite is
big picture.
People were rating how much they like an event that they participated in.
One to seven, how much did you like it?
Okay.
And she needed people to like it less in one condition than the other.
And in fact, that's what happened.
And we proposed that the numbers had been changed.
Somebody had said they like it as seven, but in the data they appear as a one.
This was the red flag that they came to us with, the junior researchers, because they were looking at the distribution of liking numbers and they said, look, there's just this whole mass of people who are giving all sevens and they entirely disappear.
And that's true.
It's surprising that you will move a bunch of sevens to ones and to twos.
But what do I know?
This is a weird task.
I don't know those people.
I don't know how people use the scale.
So it was a red flag.
It wasn't a smoking gun.
And so what I told them, I said, look, if the numbers were changed, there may be a trail trail in the data and the other columns, in the columns that weren't changed.
There should be a mismatch.
And so what I was thinking when I said that was something like gender.
So imagine that in general, women like this thing more than men, but those people that were changed, you don't see that effect for women or something like that.
That's what I was expecting.
But we found like a goldmine, which was there was a column where people had been asked to describe in words.
what the event was like.
Okay.
And so people used words like, that was great.
I loved that best time of my life.
Or I hate it.
I felt yuck because it was a networking event.
I felt disgusting selling myself in this event.
And so the idea was, oh, so maybe if the numbers were changed, there'll be a disconnect between those words describing the very same event and the numbers summarizing the event.
Okay.
And so we looked at those suspicious numbers that we would have expected to be all ones and were all sevens.
Sevens, meaning they hated it.
And you look at the words and the words said, best thing ever, love that.
Okay.
And then you looked at the other other side, the people who gave ones and we thought were sevens, and they said, I feel disgusted.
Our hypothesis was those values were changed.
So what we tried to do, and this is what we reached out to Harvard originally, I was like, look, if you go to the Qualtrics, which is the platform where these studies are run, so the original data on the server, we told them, if you go to row 100 and you go to this column with the numbers,
You will see that even though the data sheet posted has a one on the server, the numbers actually are seven.
Here are the 20 rows you have to check if we are right those numbers are sevens if we are wrong those numbers are ones
and we thought how can check it immediately because they have access to the qualtrics data we didn't and we thought maybe the following day we would know whether we were right or wrong because once you identify exactly how the data were modified and you have access to the original data then you can check whether your hypothesis is correct or incorrect.
In the end, all of the original data which hadn't been available to you, becomes available.
And a third-party firm was hired to analyze the original data and to compare it to the altered data.
And this third party confirmed the conjectures you had made.
And they also found other ways that she had altered the data that you hadn't found.
Yeah.
$25 million lawsuit later, information was made public.
We realized, yeah, that was right.
You're listening to People I Mostly Admire.
I'm Steve Levitt.
And after this short break, I'll return with Yuri Simonson to talk about the $25 million lawsuit.
Moms and dads, do you wish you could know where your kids' shoes are at all times?
Now you can with Skecher's newest Apple AirTag-compatible sneakers, Find My Skechers.
There's a clever hidden AirTag compartment under the shoe's insole.
It's sleek, secure, and your child can't feel or see it.
Then you can check where your kids' shoes are on the Find My app.
Plus they're available for boys and girls.
Get Find My Sketchers at sketchers.com, a sketcher store near you, or wherever kids' shoes are sold.
Apple AirTags sold separately.
Comcast is committed to bringing access to the internet to all Americans, including rural communities across the country, like Sussex County, Delaware.
We were being left behind.
Everybody around us seemed to have internet, but we did not.
High-speed internet is one of those good things that we needed to help us move our farming, our small businesses, our recreation forward.
Learn more about how we're bringing our next generation network to more people across the country at ComcastCorporation.com/slash investment in America.
You know dinner dreg?
It's that anxious feeling you get trying to decide what to eat or make for dinner.
Cooking a big meal sounds overwhelming, but you're too hungry to wait for delivery.
Remember, when the clock strikes dinner, think Stouffers.
From lasagna to mac and cheese to crispy air fryer entrees, Stouffers has your favorite meal ready when you need it.
And each frozen meal is made with real ingredients and packed with flavor.
Solve dinner dread with Stouffers.
Shop now at your nearest retailer today.
In 2023, after accusing Francesca Gino of fraud, Yuri Simonson, his Data Colada colleagues, and Harvard University were sued by Gino for $25
I'm curious how Yuri felt while this legal threat was hanging over him.
It was hard for a few weeks, I would say.
It was hard in part because funding-wise, like even if you're right, just defending yourself in the American legal system, it's very expensive.
And I heard that your university wasn't willing to pay for your defense, which I find infuriating.
Is that really true?
No, no, they did end up paying for it.
They did.
The most generous one was my own school in Spain, but it was difficult because this is unheard of in Spain.
And it's also August.
In August, nobody's working in Spain.
Like literally the university is closed down.
And I don't know who to call when you get sued.
I have no idea who that person is.
So I found out the name of the person, I emailed them, and they say, is this really important?
And I said, well, yes.
we need to talk like as soon as possible.
But they were great.
They were actually very generous.
So we did what's called a motion to dismiss, which is we tell the judge, this is ridiculous.
And if the judge agrees, it's over.
And that costs about $100,000.
So the university said, we'll pay that.
Now, if the judge disagrees and says, let's take it further, let's go into what's called discovery, where both parties get each other's emails and documents and so on.
That could be a few other $100,000.
And none of the schools committed to funding us up to that point.
So that was stressful because we could be on the hook for like a million dollars.
It's not like it's up to a million dollars for something you did wrong or you made a mistake or you had an accident.
It's like you need something and you have that liability.
But then they did a GoFundMe for us.
So when the academic community heard about this, the GoFundMe project was started.
It raised hundreds and hundreds of thousands of dollars in almost no time at all.
And that made me feel really good because it's a signal of how valuable the profession thinks you're working.
I must have made you feel really good, right?
Yeah.
It's the only time I've cried for professional reasons.
It was an overwhelming feeling because you feel like it's you against the world and feeling like the community was supportive.
That really was amazing for us.
The judge has thrown out all of her claims against you, although the lawsuit with Harvard is still ongoing.
Now, the thing that's so crazy about this, which is almost like a bad Hollywood movie, is as you're researching fraud by Francesca Gino, you end up stumbling onto in the same paper, but a different part of the paper, apparent fraud by this leading behavioral scientist, Dan Arielio.
How did that come to pass?
We're talking with these younger researchers, and we're looking for smoking guns.
We're looking at this file, and they show us very blunt evidence of fraud.
And so this other study involved car insurance and they're self-reporting to the insurer how much they drove and that influences the premium they pay on their policy.
At the end of the year, you will write down the odometer in your car and they would compute how much you drove and then adjust the policy.
And what this paper was showing is that you can get people to be more honest.
by having them sign before they enter the odometer reading instead of after doing so.
And so the data was posted and these younger researchers noticed and showed us, look, the distribution of miles driven is perfectly uniform between 0 and 50,000, meaning just as many people drove 1,000 miles as 3,000, 7,000, 50,000, et cetera.
Every single number of miles equally likely, equally frequent.
But there's not a single person who drove 51,000 miles.
Anybody who's looked at data will immediately know that's impossible.
And a lot of people who have not looked at data who just driven will realize that doesn't make any sense, right?
Yeah.
And so you probably presume that DeGino had cheated on that too, right?
Because it's her paper.
She was a co-author.
We did.
We did.
That would have been the first smoking gun on the Geno case.
That would have been.
But I said, if my memory doesn't fail me, I said, this feels too clumsy.
It's not like the genome stuff is brilliantly done, but this feels like worse than that.
It just doesn't feel like the other studies we were looking at.
Like I was getting a flavor for like the fingerprint, what her data looks like.
Something funny happens in the extremes of the distribution, but this uniform business is just different.
And so we said, well, let's see who created the file.
And we saw Dana Reeli's name there.
And that was the first time we really ever thought of Dan as possibly being involved in funny business.
So we contacted them and immediately Dan said, no, if anything went wrong, none of the authors would be responsible for it.
Only I will be responsible for it.
So he immediately took ownership of that.
We had a blog post on that that drew a lot of attention.
But then there were no other public data sets.
And so our view of the time, and I don't think I've talked to people about this before, our view at the time was, okay, who can get to the bottom of it?
And so we thought that in the insurance data, only an investigative reporter could.
Somebody needed to go talk to the insurance company was our thought.
Because they're the ones who had provided the original version of the data, which was later altered to no longer look like the original data.
And you needed that comparison.
That's right.
And we thought only an investigative journalist would get that.
And that was actually true.
A reporter for the New Yorker spent a considerable time and he was able to get the data and he was able to find,
in my mind, irrefutable proof that the data were altered after they were sent to Dan.
What I think is really strange and troubling is that the investigations of potentially crooked researchers falls on the institutions that they work for.
And those institutions have such strong incentives not to find them guilty.
In stark contrast to the Geno case where she's lost her tenure and Harvard's been very public about it, Duke has taken a very different stance.
An investigation was done.
It was done extremely secretively.
Duke hasn't talked about it at all, which is interesting to me because it let Dan Ariely himself be the voice of describing what the outcome of the investigation is.
And he's no longer a named professor,
which is some form of punishment, but obviously a much less severe punishment than losing tenure.
But I don't know.
It seems to me like a failure of the institutions to police themselves.
Yeah, so a couple of thoughts on that.
So one is the most common outcome is secret, a secret resolution.
An agreement university says, if you leave, we'll give you this much money, you stay quiet, and we're all happy.
And they just say we don't comment on labor issues or what are the terms for unemployment decisions at the university.
But it's worth keeping in mind comparing Dan to Francesca that it's unlikely that Duke was able to get data of the caliber that Harvard was able to get, just because of the nature of Dan's experiment versus Francesca Gino.
So I'm convinced, there's no room for doubt that the insurance data is fraudulent.
And I don't know of an ultimate explanation that's plausible to Dan having done it, but it hasn't been proven.
That's just my belief based on all the evidence that's available.
So it's not just because it's a man or a woman or more famous, less famous, or Duke versus Harvard.
It's not matched on the strength of evidence of wrongdoing.
Aaron Powell, Jr.: Do you think that these high-profile fraud cases will have or have had a big deterrent effect, scaring off others from cheating?
If I were a cheater, I would be very afraid of you.
But on the other hand, when punishments that get handed out are so uneven, then it really says, well, look, I might get caught, it might be embarrassing, but might not end my career.
So I can do it.
What's your feeling about the deterrent effect that what you're doing is having?
I don't know the facts, but I can tell tell you like rationally, you shouldn't be less likely to commit fraud after this experience because what you will learn is there's no real punishment.
Because if the worst thing that can happen to you is that you're fired, but without fraud, you would be fired.
It's still a win-win.
For somebody like that to commit fraud, there's no real disincentive.
Because the worst that can happen is that they don't do it anymore.
And so that's why I think the rest of us have to take action to prevent it, like to not be complicit in making it so so easy for them.
You mentioned that you'd receive funding for a platform that prevents fraud.
Could you tell me more about that?
So it's called Ask Collected.
It's a spin-off, so to speak, of our website for pre-registration, which is called Ask Predicted.
And there is some version of a data receipt.
So if you go to a conference and you buy lunch somewhere and you want to be reimbursed, in most cases, you don't just tell the university, I spent $7.
You need to show them a receipt.
But then if you tell them I collected data, they don't ask you for a receipt.
And so the idea is that provide a written record of where the results come from.
And that's a combination of how the data were obtained and how the data were analyzed.
So the first question would be, is your data public or private?
You would say it's private.
And you would say, can you name the company that gave you the data?
You say yes or no.
If you say no, it asks you, do you have a contract that prevents you from disclosing who they are?
And if you say yes, it asks you who in your institution signed the contract.
And then it asks you, how do you receive the data?
And you say something like, I receive an email on such and such date with that spreadsheet that I analyzed.
You indicate who received the data, who cleaned it, and who analyzed it.
And the final output is two tables, a table with the when, what, and how, and a table with the who.
We have experienced about 15 different cases of fraud.
All of these cases would have been so much harder to pull off if you had to answer the simple questions, because now you have nowhere to hide.
So the deliverable is a URL.
You have a unique URL that has those two tables.
And the idea is that journals, hopefully, will just require you at submission to include that URL.
We think our customer here is deans, journals, granting agencies.
They want to take care of fraud, but they want somebody else to take care of fraud.
And so we're telling them, look, all you have to do is ask for the URL, and you've done your part.
So the forces that are pushing people.
to cheat, either in small ways or in really outright ways, are related to the really strong incentives that exist within academics academics and the high hurdle for tenure.
Do you think that the academic tenure system is broken, or do you think it's just a pretty good system that has strong incentives?
And strong incentives have a lot of benefits, which is that they make people work hard and try to be creative, and costs, which are that in extreme cases, people respond to incentives too strongly and in the wrong ways.
Many people blame the incentives.
for fraud and for p-hacking and for all these processes people take that lead to bad quality outcomes.
I don't so much blame that.
I think it makes sense for us to prefer researchers who publish over those who don't, those who come out with interesting results over who don't.
That's a bit of a minority view.
I'm okay with rewarding the person who's successful over the one who's not.
But the part of the incentives that I think is broken is the lack of a penalty for low-quality inputs.
And part of the reason for that is that it's so hard for the reviewers to really evaluate the work.
One way to think about the whole movement towards transparency is to make it easier to align the incentives.
So given that we reward good outcomes, it's very important to make sure the inputs are legit.
And the only way for people who are just doing this voluntarily to do that is that it needs to be easy for them.
It needs to be easy for them to know if there was p-hacking.
It needs to be easy for them to know if you made a mistake.
And that requires transparency.
Despite losing tenure from Harvard, Francesca Gino maintains her innocence.
After Duke conducted its investigation into Dan Arielli, Arielli wrote a response that Duke approved.
In it, he said that, quote, Duke's administration looked thoroughly at my work and found no evidence to support claims that I falsified the data or knowingly used falsified data in general.
I've been a longtime advocate for making data analysis a core topic in K-12 education.
My goal isn't to turn everyone into a data scientist.
It's to equip the next generation to be thoughtful consumers of data analysis.
Yuri Simonson is providing an incredibly valuable service, debunking individual studies and developing strategies and tools for rooting out bad data analysis.
But there's only one Yuri.
We would need thousands of Yuri's to keep up with the flow of misleading studies.
Everyone needs to be able to protect themselves by knowing enough about data analysis to be an informed consumer and citizen.
Meanwhile, if you'd like to hear more about the problem of fraud in academic research and the steps that some people are taking to fight it, check out episodes 572 and 573 of Freakonomics Radio, which you can find in your podcast app.
This is the point in the show where I welcome on my producer, Morgan, to tackle a listener question.
Hi, Steve.
So in our last new episode, we had an interview with climate scientist Kate Marvel.
And at the end of the episode, you polled our listeners.
You wanted to know whether people were optimistic or pessimistic about our future climate 50 years in the future.
So 50 years from today.
You wanted to know if A, they were optimistic or pessimistic, B, their age, and C, their country.
And you have tallied up the responses.
I have.
So as usual, we got a very enthusiastic response from our listeners.
So let me start with the most basic question, Morgan.
What share of respondents would you say were optimistic?
I'm going to go against my gut, which is never a good idea, but I'm going to say 65% were optimistic.
All right.
So the answer was 42.6%,
which when you read the responses, you just realize what a terrible question it was because nobody really knew how to answer it.
And there were a fair amount of wafflers.
I left those out of that calculation.
So about 15% of the people clearly waffled.
They didn't want to take a stance.
But what was interesting, and it really was what prompted the question in the first place, is that the kinds of logic and arguments and data that people sent were pretty similar of the optimistic and the pessimistic.
It's just a really hard forecasting problem.
And I think for a lot of people, to try to make it this black and white comparison between pessimistic and optimistic was just a really hard challenge.
So do you mean that people who are optimistic or pessimistic were pointing to the same information and then just coming away with different opinions about it?
Yeah, I would say the responses were remarkably thoughtful.
And the people who were pessimistic gave really good arguments about why they should be pessimistic.
And I think they were the kind of arguments that optimists wouldn't disagree with.
And the same with the optimists.
At some basic level, it's probably just not that clear whether you should be optimistic or pessimistic.
Okay, so that's the one piece of data that we collected that is really legitimate.
Now, what we're going to do next is we're going to do data analysis the way psychologists did it 20 years ago.
It's exactly Yuri Simpson's point in that paper about when I'm 64.
Now, I built in lots of degrees of freedom in my survey because I know how old people are, I know what country they're from, I can deduce their gender based on their name.
But then there are also a lot of subtle dimensions like did they respond within the first 24 or 48 hours?
Did they respond in the morning or the nighttime?
So in this kind of setting, you have almost infinite possibilities to try to create something interesting when there's nothing interesting.
And I really want to highlight that because if you're a passive listener, even after this episode that just emphasized how people with data can try to trick you, I think there's a good chance I could have tricked you by talking about what we're going to do next, like it's science, when really there's nothing scientific about it at all.
It's just a way to try to have fun with data.
Okay, so what was the first lever?
The first lever is age.
And actually, as I suspected when I did the survey, the data about the demographics to me turned out to be more interesting than the answers about optimism and pessimism.
So what do you think the median age was among our respondents?
46.
Not bad, 42.
I was expecting younger.
Okay, so then let's tackle the question.
Do you think if you then divide our sample into the people who are younger than the median age, so younger than 42 versus older than 42, which group do you think came back as more optimistic about the future of climate?
I think the younger people were more optimistic and older were more pessimistic.
So, that is true.
47% of the younger people were optimistic versus 39% of the older people.
Now, that is not statistically significant.
None of the things I'm about to tell you, anything related to pessimism or optimism, turns out to be significant.
This was actually one of those rare cases where even when I tried to cut the data in a bunch of different ways, I could not find a single one that was statistically significant.
So there was really a whole lot of nothing going on in this data.
So I couldn't even do gender because, as usual, we have this incredible gender skew in the data.
So this time, 84% of the respondents were men based on my analysis of their name.
Given that, it turns out men were slightly more optimistic, but again, not statistically significant at all.
Okay, so geography is the last one I want to talk about.
What share of respondents do you think are from the United States?
75%.
Yeah, so I would have expected two-thirds because two-thirds of our downloads are in the United States, but only 49% of the respondents were American.
In particular, the thing that was completely and totally crazy is this.
Canada represents about 7.5% of our downloads, and over 20% of our responses were from Canada, which is just really interesting.
Just to put in perspective, 40% of the women who responded were Canadian.
Wow.
If it weren't for the Canadian women, we would have hardly had any women at all.
Now, the Canadians didn't break as either particularly pessimistic or optimistic.
It was just really interesting that they were engaged.
The Australians were the same thing.
The Australians were about three times as likely to respond as they have downloads.
That's not surprising.
We have a very active Australian listener base.
Yeah, that's absolutely true.
So those are the only two things for which I found...
statistical significance in the entire data set was that the Canadians and the Australians were very fervent responders.
Was there another lever you pulled?
Well, so I did what Yuri talked about, which is I looked at all of the crosstabs.
Okay, what about foreign women or old Americans?
And none of those showed anything at all.
Honestly, I kind of ran out of steam after a while trying to look at all of the different levers because the stakes were low.
If I had actually done this as an experiment, invested lots of time and I were a psychologist 20 years ago.
I probably would have put a lot more effort into cutting in all the different ways because I was looking at an important publication.
Whereas here, I'm just trying to fill a couple minutes on the podcast.
Listeners, if you have a question for us, if you have a problem that could use an economic solution, if you have a guest idea for us, our email is pima at freakonomics.com.
That's p-im-a at freakonomics.com.
We read every email that's sent, and we look forward to reading yours.
In two weeks, we're back with a brand new episode featuring Nobel Prize-winning astrophysicist Adam Rees.
His research is challenging the most basic ideas we have about the universe.
As always, thanks for listening and we'll see you back soon.
People I Mostly Admire is part of the Freakonomics Radio Network, which also includes Freakonomics Radio and the Economics of Everyday Things.
All our shows are produced by Stitcher and Renbud Radio.
This episode was produced by Morgan Levy and mixed by Jasmine Klinger.
We had research assistance from Daniel Moritz-Rabson.
Our theme music was composed by Luis Guerra.
We can be reached at pima at freakonomics.com.
That's P-I-M-A at freakonomics.com.
Thanks for listening.
It's funny, I forget how old I am.
The Freakonomics Radio Network, the hidden side of everything.
Stitcher.
Moms and dads, do you wish you could know where your kids' shoes are at all times?
Now you can with Skecher's newest Apple AirTag compatible sneakers, Find My Skechers.
There's a clever hidden AirTag compartment under the shoe's insole.
It's sleek, secure, and your child can't feel or see it.
Then you can check where your kids' shoes are on the Find My app.
Plus, they're available for boys and girls.
Get Find My Sketchers at sketchers.com, a sketcher store near you, or wherever kids' shoes are sold.
Apple AirTags sold separately.
Comcast is committed to bringing access to the internet to all Americans, including rural communities across the country, like Sussex County, Delaware.
We were being left behind.
Everybody around us seemed to have internet, but we did not.
High-speed internet is one of those good things that we needed to help us move our farming, our small businesses, our recreation forward.
Learn more about how we're bringing our our next generation network to more people across the country at ComcastCorporation.com slash investment in America.
You know dinner dread?
It's that anxious feeling you get trying to decide what to eat or make for dinner.
Cooking a big meal sounds overwhelming, but you're too hungry to wait for delivery.
Remember, when the clock strikes dinner, think Stouffers.
From lasagna to mac and cheese to crispy air fryer entrees, Stouffers has your favorite meal ready when you need it.
And each frozen meal is made with real ingredients and packed with flavor.
Solve dinner dread with Stouffers.
Shop now at your nearest retailer today.