Google Is Exposing Peoples’ ChatGPT Secrets

46m
We start this week with Joseph’s story about nearly 100,000 ChatGPT conversations being indexed by Google. There’s some sensitive stuff in there. After the break, Emanuel tells us about Wikipedia’s new way of dealing with AI slop. In the subscribers-only section, Sam explains how we got to where we are with Steam and Itch.io; that history goes way back.

YouTube version: https://youtu.be/mQJvOTHu61I

Nearly 100,000 ChatGPT Conversations Were Searchable on Google

Wikipedia Editors Adopt ‘Speedy Deletion’ Policy for AI Slop Articles

The Anti-Porn Crusade That Censored Steam and Itch.io Started 30 Years Ago

Subscribe at 404media.co for bonus content.

Learn more about your ad choices. Visit megaphone.fm/adchoices

Listen and follow along

Transcript

UbiKeys are the original passkeys, small, sturdy, and easy-to-use physical security keys that prevent phishing attacks and account takeovers.

UBKeys are manufactured by UbiCo, which is a company with headquarters and manufacturing centers in both Sweden and the United States.

Unlike basic multi-factor authentication methods such as SMS, one-time passcodes, or mobile authenticator apps, UBKs provide modern MFA.

and are a proven security solution that cannot be hacked or bypassed by malicious actors, stopping AI-powered cyber attacks, online identity scams, fraud, and account takeovers.

UBKeys help businesses of all sizes, from large banks and tech companies to critical manufacturers, energy concerns, and government agencies stay ahead of evolving cyber threats and regulatory requirements.

They also protect individuals and everyday users by securing email, banking, and social media accounts, password managers, productivity tools, developer tools, and more.

For more information on how UBKs secure applications, services, and accounts for both individuals and businesses, visit ubico.com slash 404 media.

And for a limited time, get $10 off your order of exactly two keys from the YubiKey 5 series or security key series using the code 404media10 at checkout.

That's ubico.com/slash 404media and use code 404media10 at checkout.

That's ubico.com slash 404 media and use code 404 media10 at checkout.

Hello, and welcome to the 404 Media podcast, where we bring you unparalleled access to hidden worlds, both online and IRL.

404 Media is a journalist-founded company and needs your support to subscribe.

Go to 404media.co, as well as bonus content every single week.

Subscribers also get access to additional episodes where we respond to their best comments.

Gain access to that content at 404media.co.

I'm your host, Joseph, and with me are 404 Media co-founders Sam Cole.

Hello.

And Emmanuel Mayberg.

Hello.

So first of all, we had our party in Los Angeles,

party slash live podcast recording, the audio and the video, I guess, because we did it on YouTube as well.

We will put that into your feeds.

soon.

Sam, what did you make of the party?

Did you have a good time?

Yeah, it was awesome.

So many people came.

We packed the house.

It was a little bit crazy at times how many people were in there.

There were concerns about

code and capacity, but it worked out.

It was great.

Yeah, we did a little test live stream.

We'd never done live streams before.

So Jason and his friend Raul

came through and

helped us set that up.

There was a really good, if you go to the YouTube and you go to the live section, there's a really great conversation that happens just organically while we're setting up the live stream with the folks at Rip Space, which is the hackerspace where we had the event.

They just like sat down with Dexter Thomas, who does the Kill Switch podcast and was also helping us do this.

They just sat down with him on the couch and started streaming and did like an impromptu panel.

It was awesome.

It was very sick.

So that's the first like hour of the

YouTube live.

And then

you'll see us come back on

and

do our panel, which you and Jason talked about your reporting on ICE and surveillance and flock and all that good stuff.

I was in bed watching the live stream, like the Wolverine meme.

Yeah, I assumed you were tolling this in the chat.

You were probably a non in the chat.

But yeah, it was, it was really good.

What did you think, Joe?

You were there too.

Yeah, I really, really enjoyed it.

And mentioning the live stream, it was just a very, very low stakes.

Hey, we're testing out this tech.

We've never done it before.

I think I literally posted that, posted that to Blue Sky, like, we're testing out this technology.

Join if you want.

And a fair few people jumped into the YouTube live chat and were really, really supportive.

So it shows, oh, actually, we can do this sort of thing.

And we will do that

in the future.

As Sam said, if you want to see the panel discussion, you go to the YouTube link and all that sort of stuff.

It's just on our YouTube channel.

If you don't want to fuss with any of that, as I said, don't worry.

We will

get just the stuff

from the second half, like the live podcast, from the video and the audio, and we'll put that into the feeds as normal.

So just keep an eye out for that at some point if you don't want to faff of all of the timestamps and everything.

But speaking of events as well, we did that.

We did that live podcast recording.

Now we might be doing another one.

Well, we're definitely doing another event.

We'll see exactly what we do there.

But Sam, do you just want to tell people about this New York party that we have on the horizon?

You just want to tell them about that.

Yeah, I got back from LA on Monday and then realized that it's August and our second anniversary is on August 22nd.

So we're having a

second annual anniversary party in Brooklyn on the 21st.

It's going to be at Farm 1, which is a vertical farm slash microbrewery in Brooklyn.

We're working on getting the ticket sales page and all that good stuff set up.

We're hoping to maybe do like a some kind of similar like panel and

maybe even a live stream and we can figure it out.

But TBD on the details, just watch for

that ticket information and get yours when it comes out.

Last year's anniversary party was also a huge hit.

So, and those tickets sold out really fast.

So, yeah, not to leave people on a cliffhanger on that, but

it's coming soon.

Yeah, keep an eye out in the weekly newsletter.

We definitely announce stuff there.

And of course, the next episode of this podcast, there'll be details of where and how to get tickets for subscribers and non-subscribers.

Okay, with that housekeeping out of the way, Sam, do you want to take the lead on this story I wrote?

And you can grill me about it because you edited it.

Yeah, I thought the story was super interesting.

And as with many ChatGPT, leak slash exposure type stories

is

horrifying.

And it blows my mind that people put private information into ChatGPT every day.

So the headline is, nearly 100,000 ChatGPT conversations were searchable on Google.

So this

saga begins before today.

Do you want to kind of walk us through where the story started?

Yeah, it begins while we were doing the LA party or something around there.

So I kind of missed this when it broke.

But there was a Fast Company story last week,

July 30th, I think.

And Fast Company said it found that people were, it seems inadvertently, exposing

the contents of some of their conversations with chat GPT.

And the way that works is that ordinarily, when you're speaking to ChatGPT, those conversations are basically private.

I say basically because, of course, OpenAI can go in and probably review stuff with that caveat, but it's not like you can just stumble across the contents of someone's conversation.

You're logged into the service, right?

And

to view any of those conversations, you'll also need to be logged in.

But there's this interesting feature built into ChatGPT that allows people to share their conversations.

So they scroll down on the page or whatever, they select, I would like to share this, and ChatGPT gives them a little warning saying that, well, you know, this is going to be accessible by more people, probably.

Of course, that's what people want to do because then they share a link to the ChatGPT conversation

and they provide that to somebody.

Now they don't need to log in and they can just read the the content.

The problem is Google sees that because of course it is indexing the public web.

And what this feature has done is essentially create a public web page of that conversation.

So the result that Fast Company found was that, oh, a bunch of people, it seems, may not realize they are basically exposing their chat GPT conversations, not just to the person they shared the link with, but to the wider internet, because OpenAI hasn't configured it in a way where, hey, Google, please don't scrape this web page.

Yeah.

I mean, ChatGPT being

private by default is more than like Venmo used to do for

memos or even that Meta used to do for the chatbots that we found were just blasting conversations out into the open.

So there's that, but like

it being so easy to find these cache links was really interesting to me.

So people being

people on the internet, what did they do with this revelation from the Fast Company article?

Yeah.

So people obviously started going to find these indexed web pages, these chat GPT web pages on Google.

One example was the sort of OSINT trainer and practitioner Henk Ven S,

they ironically used Claude, they say, to generate some Google Dorks

to figure out, well, how could I best search this for sensitive information?

So they do that.

And then the Google Dork returns, and sorry, when I say Google Dork, that just means a very specific Google search.

Like a really basic one would be file type, colon, PDF, then your search term.

and that would only return things which are inside a PDF for your, um, whatever you're looking for, your keywords.

So, when he did this, he was doing searches like, I want to share, I want to search Chat GPT for phrases like, write my essay, or plagiarism, or my assignment, do.

Another one is, without getting caught, avoid detection.

Then some like corporate stuff, such as my company, strategy, or revenue, or acquisition, all of that sort of thing.

And then even my SSN.

And you can see where he and others are going with this.

And he says that he found pretty sensitive stuff in there, such as confidential financial data about an upcoming settlement, non-public revenue projections, intelligence about companies that may be merging together, and some NDA stuff as well.

I will say I haven't seen any of these search results, so I can't verify or vet for those specifically.

But yeah, when Fast Company first reported this,

researchers then jump on it and they start finding all of their own stuff as well.

Yeah, so you mentioned some of what was in the chats,

like the plagiarism stuff.

What else is in there?

Was there anything like super sensitive?

Like, what are we talking about?

We're talking about

what was specifically in the chats.

Well, that's the thing.

HankFSS doesn't really provide a ton of um specifics but what they they they say that there's sensitive corporate stuff um in there they didn't want to quote them directly and i think that makes sense because of course at the time this is going on um again we were away so i'm kind of coming to the story a few days later with our own story with new information but This is all out there.

And if you were being too specific, you can inadvertently direct readers to go dig this up themselves.

And if there's anything truly sensitive, you don't really want to do that.

And I could say this later, but I might as well say it now.

OpenAI is dealing with it.

They are

removing certain indexes from Google.

And

OpenAI has removed this opt-in share feature as well because it appears they realize that, oh, people are doing this and they don't fully understand the consequences of what's going on here.

Yeah, because they think that this stuff is private.

So they're just going for it.

I thought that

this is so low stakes out of everything else that could be in Trash GPT messages, but

asking, you said people were asking Trash GPT to write their LinkedIn posts, which I would guess.

is like not only a lot of LinkedIn at this point, but also like a lot of what Trash GPT gets asked to do on top of like term papers and stuff like that.

I don't know.

It's just depressing to me.

Well, yeah, and that leads to our story, which is that when

people were digging through it, they were dealing with maybe sort of hundreds of queries, that sort of thing, and just seeing what they could pull out of the indexed

Google pages.

I then get this tip that a researcher who I granted anonymity to had found many, many more.

They had found nearly 100,000 chat GPT conversations, whereas earlier people had just probed, I think, 500, something like that.

They

scraped all of those pages.

They also scraped...

So not just the Google results, but the actual content of the chat GPT messages themselves as well, and then gave me access to it.

And yes, that's when

I log in and I start probing around to see what is interesting and sensitive, and that could come up.

And it really goes from the sensitive and the delicate to the really benign, as you say, with like the LinkedIn stuff.

I think to find that, I literally typed in, write my LinkedIn, my LinkedIn, or something like that.

And so many results come up, like, write me a LinkedIn post in this vibe to touch on these points or whatever.

There was one in there that, again, I didn't, I didn't want to quote particularly directly, but the vibe is that clearly somebody, it appears a man, is thinking about their

ex-girlfriend and he's asking ChatGPT,

why is she not looking at my stories?

And like, clearly, having some emotional distress over this relationship.

And ChatGPT is walking them through about how you shouldn't break up, you should talk to your current girlfriend about these feelings, like stuff you really

don't want on the internet you know I would also

probably say you probably probably shouldn't talk to chat GPT about this stuff either ideally I don't know how good the advice is but if people get closure or benefit out of that hey sure go wild but you don't want that being um on Google right you don't want people to people to be able to come across it and then dig through it or anything else and

it looks like many of the chat logs were generated by people anonymously.

Like it said, it doesn't have a username for all of them.

That being said, I mean, there are clearly names in there.

There was one I saw where someone had obviously built a rapport with this chatbot and sort of their version of ChatGPT, because of course you build a history and a dialogue with these tools.

And I can't remember the person's name, but he said something like, yo, yo, I'm back.

It's uncle, name.

And then ChatGPT greeted them once again.

maybe i should have quoted that one a bit more directly because it didn't seem to be particularly sensitive but yeah there's stuff in there that you do not want to be online and i also saw corporate stuff i saw someone uploading what they said was a copy of open ai's own non-disclosure agreement for visitors to the company's headquarters i emailed chat uh open ai asking is this really your nda and they didn't get back to me somebody thought it was though and they were pulling it into chat gpt and now ironically that has now been posted online because of ChatGPT's privacy settings.

It's so crazy.

It's so, I mean, it's like, I think that's a big benefit a lot of people find with ChatGPT is that you can be totally cringe and like open with this thing in a way that you might not want to be publicly.

So it's not surprising that this stuff is contained in these chats, but

oh man, it is.

It's, I don't know, it's scary, it's funny, and that I think is usually our wheelhouse.

It's like this intersection of like, I'm laughing because I'm terrified of the future.

So

knowing all of this,

knowing that this was kind of almost like a built-in feature that ended up leaking these chats, what do you take away as far as like the privacy lessons?

Like, are there any,

how can people avoid something like this happening in the future?

What should companies do?

Like, what do you kind of see as the lesson here?

Yeah,

it feels kind of similar to the early days of app development on smartphones and all of that sort of thing, which is not to say that ChatGPT is like a super basic app or anything like that.

I'm sure they have pretty damn good security over there, or I would hope so, considering how much money they have.

And I know some of the people they have hired are very, very competent.

It's much more on a user level, in the same sort of way where you grant location data permissions to an app, and you you may not fully understand what's going on there because it's very, very opaque to you.

And then, lo and behold, your location data is now being sold to a company that then sells it to DHS or something, which is not going to be a concern for a lot of people, but it definitely could for some others as well.

It's similar to that, where as a consumer or a user of this tool, there are these second or third-degree events that you may not necessarily understand.

And it even just reminds me of people sharing like a Google Drive or a Google Doc link without fully understanding anybody who clicks on this link is going to be able to read

your terrible article draft.

I mean, that would be in my case, or anything else, right?

Or your calendar can sometimes be accessible.

And when people are using these tools,

they may be very focused on, well, I'm not going to put sensitive information into the chat dialogue.

I'm not not going to give my banking information, although it seems that some people did that as well.

And they think that's sort of the be-all and end-all of privacy when it comes to these tools.

But there's like an app development sort of issue as well, where the user can really mess up if they do that without fully understanding.

But it's definitely not all

on the individual consumer because OpenAI, it seems, didn't communicate this fully and didn't take steps to stop Google scraping it.

You know, a Google Drive link is not typically going to appear in Google search results unless it's been pasted on the web page or something like that.

So kind of everybody's at fault here.

And I think as more people use AI for more and more sensitive stuff, you've just got to be really, really careful about how that might trickle out if you're not entirely sure what you're doing, you know?

Yeah, at this point, I treat everything, almost everything that I type onto a screen as

potentially leakable, or I think about it a lot more than I used to.

And I used to think about it a ton.

So I'm thinking about it all the time now.

But like, if you're typing something out, it could go anywhere.

If you're typing it into a platform that you don't own,

that doesn't have like deleting messages or something like that, it could go somewhere that you don't want.

Screenshots exist.

You know, it's

people are just, I don't know, people are putting wild things into these tools that they ultimately have no control over, which is wild to me.

Yeah, if you don't have any closing thoughts,

I will play us to the next one.

What do you usually say?

I'll play us out in the next story.

Usually, what I do is,

I mean, I will do it now, but I do a little preview of the next one and then we go to it.

But I will say, as a closing thought,

there are more exposures here.

I don't really want to say right now because we're pretty busy.

So maybe we don't get this story out in time.

But

there's other stuff going on here.

So

we probably won't talk about in the next episode because we try not to repeat it too much.

But there might be another article coming.

I'll say that.

And we'll leave that there.

And we'll leave that there.

That's what I was trying to think of.

Sorry, I didn't write that down.

I should put that in the Google Doc.

And we'll leave that there.

And when we come back, we're we're going to talk about one of Emmanuel's stories about a policy change or Wikipedia that might, you know, protect the site and the platform from AI slob.

We'll be right back after

this.

Why drop a fortune on basics when you don't have to?

Quince has the good stuff.

High quality fabrics, classic fits, and lightweight layers for warm weather, all at prices that make sense.

I'm always on the lookout for new basics companies, and everything I've ordered from Quince has been nothing but solid.

Quince has closet staples you'll want to reach for over and over, like cozy cashmere and cotton sweaters from just $50,

breathable flow-knit polos, and comfortable lightweight pants that somehow work for both weekend hangs and dressed up dinners.

The best best part?

Everything with Quince is half the cost of similar brands.

By working directly with top artisans and cutting out the middlemen, Quince gives you luxury pieces without the markup.

And Quince only works with factories that use safe, ethical, and responsible manufacturing practices and premium fabrics and finishes.

In the last few weeks, I picked up a 100% European linen shirt that has entered my regular rotation.

That's this right here.

And I've also got a few 100% cotton tees, which have gotten compliments for how well they fit.

And they come in a lot of varieties, which is good because I really like a sturdy, thick look, and Quince has those as well.

We also picked up a basket weave quilt that has quickly become our favorite blanket in the house.

Keep it classic and cool with long-lasting staples from Quince.

Go to quince.com slash 404 Media for free shipping on your order and 365 day returns.

That's quince.com/slash 404 media to get free shipping shipping and 365 day returns.

Quince.com slash 404 media.

One of the scariest parts about building 404 media was figuring out the logistics of, well, how to do business.

There's a handful of tools that make 404 media run, but it's been a real pleasure to use Shopify, which has given our company a footprint in the real world with our merch store.

Without Shopify, I don't know how we would have done it.

So if you're thinking of starting a business, start with Shopify.

Shopify is the commerce platform behind millions of businesses around the world and 10% of all e-commerce in the U.S.

From household names like Mattel and Gymshark to brands that want to be household names like 404 Media.

Shopify has got you from the get-go with beautiful, ready-to-go templates to match your brand style.

Their easy-to-use backend helps you manage your store's inventory and makes creating an attractive shop for your customers really easy.

They also help you find new customers with easy-to-run email and social media campaigns.

And if you get stuck, Shopify is always around to share advice with their award-winning 24-7 customer support.

So turn those dreams into and give them the best shot at success with Shopify.

Sign up for your $1 per month trial and start selling today at shopify.com slash media.

Go to shopify.com slash media.

Shopify.com slash media.

This is an ad by BetterHelp.

These days, it feels like there's all kinds of ways to treat your mental health.

Cold plunges, gratitude journals, screen detoxes, but it's hard to know which one of these will work for you and what is just noise or fads on the internet.

One thing that's long tested and long trusted is talking to live therapists to help get you personalized recommendations and help you break through the noise.

As a therapist gets to know you, they can provide personalized suggestions for positive coping skills, stress reduction techniques, and strategies that will help you become the best version of yourself.

BetterHelp is easy to use and easy to plan around.

It has more than 30,000 therapists and has served more than 5 million people globally, meaning you can fit therapy into your busy life.

Join a session with the click of a button and switch therapists at any time.

As the largest online therapy provider in the world, BetterHelp can provide access to mental health professionals with a diverse variety of expertise.

Talk it out with BetterHelp.

Our Our listeners get 10% off their first month at betterhelp.com slash 404 media.

That's betterhelp, h-e-l-p.com slash 404 media.

All right, and we are back with Emmanuel's story.

The headline is, Wikipedia editors adopt speedy speedy deletion policy for AI slop articles.

First of all, Emmanuel, what is the AI Wikipedia problem?

Does it have a big AI slot problem?

We've spoken about it a little bit before, but like, what's the problem there for Wikipedia?

I would say Wikipedia has

a bigger problem than you or any other

average user of the site realizes.

and that is because of the incredible effort

that

the Wikipedia editors, the contributors, the volunteers, the people who maintain Wikipedia

put into the site to kind of protect you from that problem.

So it exists.

Some of it is visible in the sense that sometimes AI generated articles that are wrong, that are filled with hallucinations and fabricated information, do make it to like the live version of the site.

But behind the scenes, the editors who approve articles and discuss articles, they're dealing with a huge flood of AI-generated articles in the same way that

all the platforms that we talk about here every week are dealing with that on Facebook, on Instagram, on Twitter.

YouTube, just people are flooding Wikipedia with AI-generated articles.

You're You're not seeing it because the editors are filtering that stuff out.

Yeah.

Maybe we don't know this, although I feel it came up in conversations before.

What does some of that slop look like?

Obviously, Wikipedia, it is articles about specific subjects.

Is it people trying to fuck with Wikipedia?

Is it people who think they're really smart and they've discovered something through Chat GPT and they're like, I have to now tell the world this on Wikipedia?

Like, do we know what the slop is exactly?

I presume it's varied, but.

Yeah, it's varied.

It's funny you mention it.

I talked to you about this months ago.

I doubt that you remember, but after I wrote this article about a group of Wikipedians that have this initiative to protect the platform from AI-generated content, they put together this

document

showing examples of AI-generated content on Wikipedia.

And one of them is an article about the Rule Alum Dubonde,

an Islamic seminary in India.

And it's like,

there's

an old painting showing some of the people in that article.

And it's just AI generated.

And, you know, they have six toes and stuff like that.

There was another article about

like it was a castle in Turkey, I believe.

Long article, thousands of words, completely made up, like just like total fabrication of ChatGPT.

And I got in touch with the guy who made that article.

And I was like, what's up?

Why did you do that?

And I got a very confusing answer.

It sounds like

maybe he's Armenian and there's some sort of attempt to show.

He's either, no, he's Turkish who takes issue

with the

articles on Wikipedia about the Armenian genocide.

And he was kind of trying to show that anyone could get anything up on Wikipedia.

So it was like some sort of attempt to undermine the validity of Wikipedia as a reliable source of information was his reasoning.

But it was kind of nonsensical.

It was somebody fucking with the platform to make some kind of point.

It's like the AI generated article that I saw.

Right.

And it does remind me now you mention it of that American who ended up writing half of Scott's Wikipedia, like obviously a specific

language, right?

And this teenager did not speak the language, but somehow blagged their way

through filling up massive swaths of Wikipedia, which I don't know.

Obviously, that's a massive time commitment because that didn't involve AI.

And maybe ironically, if you would try to do something like that,

you know, to be a bit of a dick or for your own ego or whatever, like if you used AI, it might actually be harder because maybe you'd be detected by Wikipedia more, you know what I mean?

Rather than the artisan hand crime control.

Yeah, I mean, it's interesting you bring that up because

one of the editors I talked to for this article

brought up this point.

The AI-generated articles are a bigger problem for them

for the same reason it's a problem for moderation on other platforms, which is that AI content being generated at the rate of a machine, whereas the verification of the articles is being generated at the rate of like a team of human beings reading every word of the article.

So there's like an inherent imbalance there.

So even though this stuff can be very easy to detect sometimes, there's just so much of it.

that they had to change their policy about how they review articles, which is kind of like the policy at the heart heart of this article.

Yeah, so what is this new policy exactly?

Because it sounds like they're already pretty well equipped.

What's the new policy?

Yeah, so Wikipedia is run by a community.

Everything is done by open discussion and consensus or voting.

And articles get deleted all the time for a variety of reasons.

Sometimes articles are just plainly planted advertisements for companies or people promoting something.

Sometimes

they're just pure gibberish.

Sometimes

people

will argue about whether an article deserves its own,

whether something deserves its own article or it should be a segment of another bigger article.

And most of those

deletions happen after a seven-day period of discussion.

It's like somebody says, hey, I think this should be deleted because this article is not notable, meaning it should be like either deleted or integrated into another existing article.

And then editors have seven days to kind of discuss this and hopefully reach a consensus or at least like a vote about whether that is true or not.

For some of the more obvious cases, like these gibberish articles that I've mentioned, they have a speedy deletion policy, which means one person flags the article, an editor or an administrator sees it, they confirm it's just like it's nonsense, it's not even English, and then they can just delete it without this like week-long discussion.

And the proposal that was eventually adopted yesterday is a speedy deletion process for certain types of AI articles because there are so many of them now.

So, changing the policy to include another type of article for speedy deletion is a big deal on Wikipedia.

They can't do it for

the majority of AI-generated articles because the majority of AI-generated articles, there's some doubt.

People can argue, like,

oh, there's bullet points and bolded chapter heads and m dashes,

all these signs, right?

I don't think you get those on Wikipedia necessarily, but that sort of thing.

Yeah, like all these signs that are clearly like the style of a chat GPT, but people can argue: like, A, is this really AI-generated?

And B, like, I don't know, maybe there's a value to it, you know.

Um, so

those still have to be debated.

But there's another category of AI-generated articles where, like, there's two conditions that they have to, one of two conditions that they have to meet in order to be eligible for one of these speedy deletions that don't require a discussion.

One is if they include what is clearly language directed at the user.

So we have used this method to identify AI generated, LLM generated content many times.

And that's when you're talking to ChatGPT, you say, ChatGPT, tell me about

the history of aviation or something, right?

And it will say something, as per my last knowledge update,

which means like, according to like all the information I've ingested as an LLM last time in like, you know, January 2024, this is the history of aviation.

And that language is clearly an LLM talking to a user that prompted it.

And if it includes that, A,

it means the article is clearly AI generated.

And B, More importantly, it means that the person who submitted the article didn't even bother to read it, it, right?

Because if the person read it, he would notice that and remove it and then submit it.

And that's the worst bit.

That's just ruse.

It is.

No, that's actually the policy.

They're like, if the person who submitted the article didn't even read it,

we're not going to sit here and debate it for seven days whether it's worth deleting.

That just indication nobody cares about this article.

It's just white noise.

Let's get rid of it.

The other

condition is if the article includes references that don't actually exist.

So I think Sam reported on this a few times.

A bunch of lawyers have been busted doing this, where they'll submit a complaint and they'll cite, you know, cases like Cox versus Cole.

And, you know, 1999, they're like, you look it up and it's like, it never happened.

It's on a case that exists.

Classic.

What are you talking about?

Yeah.

Precedent setting case.

So, like, when that happens,

Wikipedia takes sourcing very, very seriously.

I would say it's like one of the primary uses of the site and why it exists.

It's like not just about reading the summary, but being able to follow the sources to justify why stuff is included in the article.

And when you include a fake source, you really violate a basic principle of Wikipedia.

So if those

citations are clearly fake, if they resolve, if they don't resolve, then lead to like a non-existent page, a 404 page on a scientific journal or something like that, that is also a reason to just remove the article without discussion.

Yeah, that makes sense.

I guess the last thing, well, no, I'll ask two more things.

You touched on this and maybe it's a little bit unclear, but those examples are very obvious.

Going back a little bit to the harder ones where they have to have this longer discussion, any idea how they're going to determine it?

Or is it more case by case and you figure it out?

Because I mean, it's getting harder and harder.

Some are very obvious, but it is getting trickier to figure out if something's AI.

Have they given any specifics on how they might do that?

Or maybe they find out now this policy is out there, you know?

So, for the story, I talked to Ilyas LeBlue.

I hope I'm pronouncing that correctly.

They are one of the founding editors of the wiki project AI Cleanup, which is like this

group of editors that are actively trying to protect Wikipedia from AI-generated content.

And the way they put it to me is that

this is

definitely an improvement and it puts Wikipedia at a better place than it was, but like the AI issue is definitely not resolved.

Like the issue will persist and it's still a big issue.

It is helpful in two ways.

One, There's a category of AI-generated content now, which can more easily be deleted.

And that's a big deal.

That just reduces their workload, which is like the main purpose of this policy.

It frees them up to

review all this other content.

The other reason that it is useful is

AI-generated content and

how it is being

used on Wikipedia has been pretty controversial.

I wrote a story a couple of months ago about the Wikimedia Foundation, which is like the nonprofit that

owns

Wikipedia.

They

introduced a measure or

like a feature.

They were piloting a feature that would include an AI summary of the article at the top of the article.

And Wikipedia editors basically revolted and got so mad that Wikimedia decided to pull it back.

And this is kind of the first

editor-led policy change that has a clear anti-AI policy.

And LaBlue, this editor, thinks that it's useful in that sense because it's like black and white policy, anti-AI, which will signal to Wikimedia and will signal to other editors and will signal to users that it's like they're very vigilant.

They're taking this issue seriously.

They're not going to roll over like some other platforms have for

AI generated content.

Speaking of other platforms, the last thing I wanted to ask is: what can other platforms learn from this?

Wikipedia is obviously a pretty unique thing, but everybody is dealing with

AI slop.

What do you think other platforms dealing with that problem can learn from Wikipedia here?

So I think 10 years ago, five years ago,

we wouldn't cover like policy changes, obscure policy changes

that Wikipedia editors are putting in place.

The reason that I think this is newsworthy, and the reason I've been paying very close attention to how Wikipedia is responding to AI is because, again, it's a community-led

project.

And when the community is at the wheel, we're seeing them actually take a position

that I think, you know, it is anti-AI

when compared to like Instagram or meta generally, which is like, we love AI, we make AI products, flood the zone with all the AI-generated content you can possibly imagine.

Like, Labloo was very measured.

They were like,

it's possible that in the future, generative AI tools will make Wikipedia better.

We already use some AI tools to do certain things.

But to speculate on how it might be useful in the future or not is blinding us to the problems that exist currently.

And the problems that exist currently is that we're getting hallucinated facts and false citations and articles.

So we need to deal with that now.

And I think that just like a good model, A, it reminds me a lot of like our approach to technology reporting.

It's just like, let's talk about what's happening now, what's happening to users.

And

yeah, I just, I think it shows that there's a different way.

You don't have to roll over and just allow it to flood your platform because it's the newest, coolest thing.

You can respect your users, respect their time, and set some sort of boundary and have some like really practical policy.

Like the thing I discussed about

the language that's clearly an LLM talking to a user.

We see that prompt in in digital libraries and scientific journals and

social media posts, all over the place.

It's like you can imagine having a filter in place at all those places that at least flags that content for review or something.

And nobody really does that yet.

So I think to see Wikipedia do it is

encouraging, and hopefully, other people take note and adopt a

similar policy.

Yeah, I think it's really interesting that a relatively small policy change could actually be an indication of how platforms, people, companies, organizations

can

resist

AI if they wish to do so.

I think it's a really interesting model and policy update for that.

Okay.

We will leave that there.

If you're listening to the free version of the podcast, I'll now play us out.

But if you are a paying for a four media subscriber, we're going to go into a really deep dive into one of Sam's stories about how we got here with Steam and itch IO and gamer censorship and all of this stuff it absolutely did not come out of no out of nowhere so if you want to understand how we got there to this to this point you can subscribe and gain access to that content at 404media.co

As a reminder, 404 Media is journalists founded and supported by subscribers.

If you do wish to subscribe to For of 4 Media and directly support our work, please go to 404media.co.

You'll get unlimited access to our articles and an ad-free version of this podcast.

You'll also get to listen to the subscribers only section, where we talk about a bonus story each week.

This podcast is made in partnership with Kaleidoscope.

Another way to support us is by leaving a five-star rating and review for the podcast.

That stuff really, really does help us out.

This has been For of 4 Media.

We'll see you again next week.