Airlines Sold Your Flight Data to DHS—And Covered It Up

46m
This week we start with Joseph’s article about the U.S’s major airlines selling customers’ flight information to Customs and Border Protection and then telling the agency to not reveal where the data came from. After the break, Emanuel tells us how AI scraping bots are breaking open libraries, archives, and museums. In the subscribers-only section, Jason explains the casual surveillance relationship between ICE and local cops, according to emails he got.

YouTube version: https://youtu.be/Auc7NPD2ig4

Our New FOIA Forum! 6/18, 1PM ET

Airlines Don't Want You to Know They Sold Your Flight Data to DHS

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

Emails Reveal the Casual Surveillance Alliance Between ICE and Local Police

Subscribe at 404media.co for bonus content
Learn more about your ad choices. Visit megaphone.fm/adchoices

Listen and follow along

Transcript

So my 404 media colleagues probably remember when I got doxxed, which was a nightmare for everyone involved, mostly me.

My name, address, phone number, social security number, and a bunch of other information was leaked online, which led to all these spam calls, harassment, threats, etc.

Even if you're not a journalist, a sophisticated network of data brokers is making your personal information available to the highest bidder.

I fixed my problem with DeleteMe.

which is a service that basically looks you up on all these people search websites and data broker websites and formally gets you removed from them.

The subscription service removes your personal info from the largest search databases on the web, helping prevent potential ID theft, doxing, and phishing scams.

I'm a real DeleteMe customer.

I've been using it for more than five years.

Signup is so easy.

You just go to their website and then they send you personalized privacy reports showing you what info they found, where they found it, and how they got it removed.

Take control of your data and keep your private life private by signing up for Delete Me, now with a special discount for our listeners.

Today, get 20% off your Delete Me plan when you go to joindeleteme.com/slash 404media and use promo code 404media at checkout.

The only way to get 20% off is to go to joindeleteme.com/slash 404media and enter code 404media at checkout.

That's joindeleteme.com/slash 404media code 404media.

Hello, and welcome to the 404 Media Podcast, where we bring you unparalleled access to hidden worlds, both online and IRL.

404 Media is a journalist-founded company and needs your support to subscribe.

Go to 404media.co,

as well as bonus content every single week.

Subscribers also get access to additional episodes where we respond to their best comments.

Gain access to that content at 404media.co.

I'm your host, Joe Swift, and with me are 404 media co-founders, Sam Cole.

Hey, Emmanuel Mayberg, hello, and Jason Kebler.

Hello, hello.

So, real quick,

hopefully, you're hearing this podcast in time.

You should be because we publish this to subscribers Tuesday evening, and then Wednesday morning, free subscribers get it.

But on Wednesday, the 18th, that's tomorrow or today, depending on where you're listening, at 1 p.m.

EST, we're going to be having our latest foyer forum this is a live streamed event of an hour realistically two hours we usually go over where we're going to explain to you how to pry records from the government using freedom of information requests and public records requests specifically We're going to be talking about a story that Emmanuel and Jason did a while back, a company called Massive Blue, who were making these AI personas for cops that pose as college protesters, really, really wild stuff.

So, if you want to learn how we did that and how you can replicate those requests,

please become a paid subscriber.

Or if you already are one, keep an eye out for an email with a link to a live stream.

We've tried to pull in a lot of places.

You know, I'll put a link

into the

show notes here as well.

And

we also tried to put the top of the emails as well.

And

beyond that, Jason, I think you wanted to talk about merch as well.

Yeah, we have merch back in stock.

Our 404 code tank tops were incredibly, incredibly popular.

So thank you all for ordering them.

I ordered a bunch more.

We have them in every size.

So if you want one of those, you can go to 404media.co and then click merch and you'll see them there.

And also, if you pre-ordered them, that means that your pre-order is going out very, very soon.

So thank you for the patience there.

Should we get into it?

Because I believe I'm asking you some questions, Joe.

Sounds good.

Yeah.

So, the first story we're talking about this week is: airlines don't want you to know they sold your flight data to DHS.

This is a really wild story.

I didn't know about this at all, that this was happening.

Where did you first find the story?

So, on May 1st, I noticed that Immigration Customs Enforcement, ICE had a new contract in these government procurement databases.

Basically, what I do is I have a shortcut on my desktop, and I'll click it, and every so often I would just check to see the latest contracts ICE has with the government.

Or, you know, I've done it for customs and border protection and other agencies as well.

It's just sort of what we're covering at the moment.

But on May 1st, I saw that ICE entered some sort of contract with Airlines Reporting Corporation.

And I'm like, well, that sounds interesting.

What the hell is that?

And I filed a FOIA.

And then I looked for other agencies that had deals with Airlines Reporting Corporation.

And we'll get into what those were as well.

But the main one that this story is about is Customs and Border Protection, CBP.

I filed those FOIAs.

Then ICE actually released some more documents about this

purchase of data.

And the lever actually reported that about a week or so after.

And now, what we have are these documents that I got from CBP, and they lay out in much more detail the sort of data the DHS is buying, the use cases for it.

And I'm sure we'll get into the most important thing, and which you highlighted when you were editing the article, the fact that the airlines were basically trying to cover it up, right?

I feel like that stood out to you.

you.

It really did stand out to me because

I don't know exactly the language that they used.

I should have the story up, which I don't.

My computer exploded while we were recording this podcast.

So I lost my tabs, but they are back up now.

But you have it up, so why don't you read it?

Yeah, so one part of the document, and this is the contract between Airlines Reporting Corporation ARC and CPP, tells the agency to, quote, not publicly identify vendor or its employees individually or collectively as the source of the reports unless the customer is compelled to do so by a valid court order or subpoena and gives ARC

immediate notice of same.

In other words, you and are allowed to reveal where this airline data came from that you were using to generate internal reports or whatever else you're going to make with the data.

Right.

So this is a part of a travel intelligence program um through as you said

uh arc and and as i understand it this is like a company slash entity that was spun up

by

most major airlines in the united states for the purposes of selling customer data like more or less like it is a data broker that is owned by american airlines and by american airlines i mean united states States-based airlines, including American Airlines.

And American Airlines.

Literally, yes.

Yeah.

So they make this data broker.

And the way it works is that

when you book a flight with a travel agent, like maybe that's online, or maybe you go to a physical one, there has to be some sort of conduit between the travel agent and you and the airline.

And ARC sits in the middle of that transaction and it's able to get this data.

I mean it provides a legitimate service there.

It routes this information.

It allows these bookings to take place.

But on the side, what Arc does is it develops products based on that data.

So maybe they can see, oh, wow, the number of flights went up after COVID or something.

That's just hypothetical, but there's all of these sorts of trends and that sort of thing.

But what they're also doing, you know, according to these documents we got and the ones published by ICE, is that ARC has a side hustle, basically,

of selling this data to the government as well.

And you mentioned some of the airlines.

I mean, there's ones on the board.

I'll just double-check them.

Yes, they have representatives from Delta, Southwest, United, American Airlines, Alaska Airlines, JetBlue, and then you have Lufthansa and Air France from Europe as well, and Canada's Air Canada.

And there's a a little bit of discrepancy in the documents we got.

It says

eight major US airlines own it, and then another one says nine.

I think one probably joined over.

We just frame it as at least eight airlines own this data broker.

Right.

And when you say travel agents, I mean, obviously, you were like, if you go to a travel agent, that would be a conduit.

But you're talking about sites like Expedia, for example.

Yeah.

You know, like really widely used websites.

I just want to stress that it's like this is not affecting only people who are going to a specific travel agent.

It's like it's third-party booking services of which there are many.

Yeah, that probably would have actually been a better way to phrase it, but like third-party booking services.

Yeah, it's not just obscure brick-and-mortar travel agents in your neighborhood or something like that.

It's massively popular sites like

Expedia where this data is being essentially harvested from.

In I think it was the ICE documents, or maybe it was the customs and border one that we got.

Interestingly, DHS says ARC does not contain data if somebody books a flight directly with an airline,

which is kind of interesting because you go, well, it's with the airlines, won't they just sell it?

No, because ARC is not in the middle of that transaction.

You're going straight to the airline, you're booking with JetBlue or United or whatever.

That doesn't end up in ARC's really big data set of billions and billions of records.

And I guess I should say that's passenger names,

the credit card used, which I found really interesting.

You can search by credit card.

And then of course, the flight itineraries.

So you know where someone has been,

where they maybe are going to fly that day, or something I found really interesting, you kind of know what they're going to do in the future, which isn't really the case with a lot of data we cover, like location or whatever.

It is predicting and showing where someone is going to be at a later date, which is pretty novel.

Yeah, yeah.

I guess I'm curious, like, do we know what law enforcement does with this type of data?

Because

some of the responses I saw to this article, and I don't think they're good responses, but some of the responses I saw were like, well, you have to show your ID when you get to an airport.

And therefore, like DHS will know that you're going to be there.

I also assume there's like some sort of roster or something.

I actually don't understand exactly how this works.

And I'd be curious to either read more about it if someone's already reported this or to do more reporting on it.

Like, how does DHS know which people are going to be at an airport on any given day?

And I would imagine that this is one of the ways, right?

It's, as you said, they can then predict

who is going to be where and at what times because they they have this sort of like future data?

Yeah, and I mean, I think an important thing to remember is that DHS is not a monolith, right?

Like, TSA is going to know who is in an airport at that time because you're showing your IDE to the TSA agent and you're literally right in front of them.

Like, you're announcing yourself, basically, right?

And they're going to have access to other data along the way there.

Other parts of DHS can get this data and potential, potentially in other ways as well.

But again, it's not like a one-size-fits-all solution.

The reason that Customs and Border says it's buying this data is it's for the Office of Professional Responsibility, OPR, which is basically like its internal watchdog, its internal affairs that if somebody in Customs and Border Protection is doing something corrupt or criminal or whatever, this internal affairs unit can and is supposed to investigate them.

And when I got a statement finally from Customs and Border Protection about this, they said it's just used for that.

It is just used for that division or unit to investigate those sorts of people.

And that's all well and good.

Some people may even say that that's a legitimate and a good use case.

But we can't have that conversation until now because we published it and because we found out and the airlines were trying to cover up in the first place.

You know, like it's really about the sale rather than the use.

Well, there's that, and then which I think we do, we should talk a little bit more about, but then

DHS is not the only agency that has bought this sort of data.

Like, ARC has deals with other agencies as well, right?

Yeah.

So, again, when I first saw the ICE deal, then I did a bunch of FOIAs, and we're still waiting for the vast majority.

But beyond customs and border protection, there's a secret service, the SEC, DEA, Air Force, U.S.

Marshal Service, TSA, funnily enough, and ATF, the Bureau of Alcohol, was it, tobacco and firearms.

Now,

I don't know, maybe SEC is using it for a very different reason to DEA.

You would imagine so, because those agencies have completely different mandates, but we don't know specifically what they're using it for yet.

And that's why we have all of these freedom of information requests out.

And again, maybe it comes back and they're using it for fairly innocuous purposes.

Maybe some are using it for much more interesting ones.

But the sale is happening in the first place.

And, you know, because the data is being sold, there isn't really a legal mechanism there.

They're just buying access to it.

Right.

And I mean,

what really stood out to me again is

that it's happening through this third party.

It's like happening through this umbrella corporation,

you know, arc.

Again, Airlines Reporting Corporation, which no one has ever, ever heard of that because they have an extremely low profile.

And then again, no one has heard of it because in its contract, it says, don't say where the data came from.

And that's like one of my favorite things to FOIA.

And that's like a really great thing to FOIA if people are listening to this and are interested in it.

Is a lot of times

when companies sign contracts with the government, the company will try to put a non-disclosure agreement into the contract, but that non-disclosure agreement itself is subject to FOIA just because of the way that FOIA works.

And that is a public record.

It's taxpayer money that's being used to purchase this, and therefore it should be available.

And so this wasn't a specific non-disclosure agreement, but it was a section of the contract that said, hey, don't say.

that we signed this contract.

Don't say that the airlines were the source of the data.

and it's an example of these companies these airlines sort of like double dipping like that they're getting into

it's just like them finding other ways to monetize um

other than just like selling you access to a flight they're figuring out like okay well now we have this huge information database about who is flying, where they're flying, what credit cards they're using, that sort of thing.

How can we further monetize this?

And I think that is a conversation worth having.

Yeah, and I think that's why so many people were pissed off at this.

You're already paying for a flight where you're going to be crammed into some economy seat with no legroom.

You're going to have to pay extra for a bag.

You have to pay for Wi-Fi or something.

And then, on top of all of that, we're also going to sell your flight data to the government.

And I don't think people are particularly happy about that.

You mentioned the non-disclosure agreements, and it reminds me of when we covered a lot of location data being sold to the government.

that is ordinary apps installed on your phone, sending location data off to a company, and then they sell it directly, or it gets sold to somebody else who then sells it to the US government, including customs and border protection.

Funnily enough, you go through the sort of contracts related to that.

And there was one for a tool called LocateX made by Babblestreet.

I think.

And there was sort of amendment in there where it said, you cannot use this information in court and like you can't reveal this information.

It's supposed to just be used for like leads and tips and intelligence.

And it reminded me of that basically, where you have these government agencies buying data and then

there may be no transparency or accountability of where that data came from or how it's being used by design.

And I guess that also leads to that,

I feel it's obvious, but almost to stress it this isn't being done with a warrant I don't think you necessarily need a warrant to get flight data ordinarily but this isn't just talking about one or two flights it's talking about customs and border protection and potentially these other agencies buying bulk access to billions of people's flight records that then they can search through basically at their own whim.

You know, I didn't see anything in the contract that says you can only use this for national security.

You can only use this for combating terrorism or something like that.

I didn't see any disclaimers like that in the contract.

So, at least theoretically, until we get more information, it's kind of up to the customer to do what they want with this information.

And we see that when people, when law enforcement agencies buy data, because that's exactly why they're buying it, they want to be able to do what they want without the legal processes in place.

Right.

This came up in the context of some of our flock reporting, which I'll just like quickly run through.

through, but a few commenters on our website were saying, well, why don't the cops need to get a warrant to search

for license plate data or whatever?

And the argument that one would make is like, you don't have an expectation of privacy when you're in public.

There is nothing stopping.

a cop from standing on a corner and writing down the license plate of everyone that drives by.

But what our laws were like not really written for was the automation of these sort of things and the privatization of it and also the fact that it's done at scale and in like a historic way.

And so there is a really interesting

lawsuit in Virginia about Flock about whether

the automation of this type of technology does change the calculus as to whether cops need a warrant or not.

And we're going to be following that.

But basically, it's like, you can stand on the corner and look at a license plate, but can you stand on every corner of every street at the same time with an automated camera, take a picture, log that into a database, you know, make a historic record of where a specific car has gone, and

do that all over the entire country all at once.

And

that's a little bit like what we're talking about here, where private companies get more and more into surveillance, they are deploying technologies and they're doing things that are allowing for

like really big, like large-scale mass surveillance.

And then because the cops are buying access to these databases, the cops feel like they don't need.

a warrant because the police themselves are not the ones who are doing the surveilling.

They are like buying access to a commercial product.

And then the commercial company is the one that's actually like doing the surveillance.

And I think that's a little bit like what's happening here and like what we've seen over and over again with social media monitoring companies, with

data brokers in general.

And

I do think it's like a big flaw in our privacy laws and something that we need to talk more about.

Yeah, absolutely.

I guess last thing on this is

future reporting, future FOIAs on this.

Like, what are you looking into next here?

Yeah, it's really just waiting to get back those contracts from those other government agencies.

Also looking into whether local police have access.

One part, when I originally wrote this story, it was focused on that the contract says customs and border protection are using this data in part to support state and local police, which is obviously very interesting.

We were right when you you edited to bring the basically the cover up higher up into the story, but I find that very, very interesting.

Do local police have access to this?

I mean, I think that would be crazy, but I've seen some pretty wild things over the last few years.

So there's that.

There may be more emails about it and that sort of thing.

And yeah, just who has access to this data on a wider scale, really.

All right.

Should we leave that there?

Yeah, let's leave that there.

When we come back, I beat you.

After the break, we will talk about AI bots that are scraping museum websites, open libraries, archives, et cetera.

It's a story by Emmanuel.

We'll be right back after this.

You know what doesn't belong in your epic summer plans?

Getting burned by your old wireless bill.

While you're planning beach trips, barbecues, and three-day weekends, your wireless bill should be the last thing holding you back.

That's why I made the switch to Mint Mobile.

With plans starting at 15 bucks a month, Mint Mobile gives you premium wireless service on the nation's largest 5G network.

It's the coverage and speed you're used to, but for way less money.

So while your friends are sweating over data overages and surprise charges, you'll be chilling, literally and financially.

Say, bye-bye, to your overpriced wireless plans, jaw-dropping monthly bills and unexpected overages.

Mint Mobile is here to rescue you.

All plans come with high-speed data and unlimited talk and text, delivered on the nation's largest 5G network.

Use your own phone with any Mint Mobile plan and bring your phone number along with all your existing contacts.

Ditch overpriced wireless and get three months of premium wireless service from Mint Mobile for 15 bucks a month.

I realized that by sticking with the expensive guys, I was literally throwing money away.

Get sell service that works great for much less with Mint Mobile.

This year, skip breaking a sweat and breaking the bank.

Get your summer savings and shop premium wireless plans at mintmobile.com slash 404 media.

That's mintmobile.com slash 404 media.

Upfront payment of $45 for three-month 5 gigabyte plan required, equivalent to $15 a month.

New customer offer for the first three months only, then full price plan options available.

Taxes and fees extra.

See Mint Mobile for details.

America is starting to talk more about mental health, but for lots of men, it still remains a taboo.

Just know that it's okay to struggle and that life is full of ups and downs.

Whether you're going through a rough period or want to make sure things keep going well, therapy can help you make sure you're at your best for yourself and everyone in your life.

There's no shame in therapy, and you're not alone.

Therapy is not just for people who have experienced major trauma.

Millions of people use BetterHelp to learn coping strategies, work through their depression or anxiety, and learn how to positively deal with the pressures of everyday life.

With over 35,000 therapists, BetterHelp is the world's largest online therapy platform, making it really accessible and flexible.

You'll definitely find a therapist that works for you and fits into your busy schedule.

If you need to switch therapists at any time, cancel or reschedule an appointment, or get in touch with your therapist, you can do it with the click of a button.

As the largest online therapy provider in the world, BetterHelp can provide access to mental health professionals with a diverse variety of expertise.

Talk it out with BetterHelp.

Our listeners get 10% off their first month at betterhelp.com/slash 404 media.

That's betterhelphelp.com/slash 404 media.

All right, and we are back.

As Jason said, this is one written by Emmanuel, and the headline is: AI scraping bots are breaking open libraries, archives, and museums.

Emmanuel, this is based on a survey just to lay the groundwork.

Who made this survey and what did it find at a high level?

And then we'll get into some of the really interesting specifics.

So, this was written by Michael Weinberg, who works at NYU

at something called the Glam e-Lab.

And that is something that NYU and the University of Exeter

work on together.

And it is basically an organization that helps small libraries, galleries, archives, and museums.

take their collections, digitize them in some way, and make them available for everyone online for free.

So

we don't know what organizations were surveyed exactly, right?

It's like an anonymous survey, is that right?

Yeah, so

we have heard anecdotally, and it's something that we've reported on before,

that AI scrapers, which are these bots that kind of troll across the internet, look for valuable training data, and then hoover it up so they can train AI models, models, are flooding all these open resources

with too much traffic, more than they can handle, and taking them offline in some cases.

And you hear about that happening at this library or that museum or this collection, but this is the first attempt by someone to like quantify the problem and see how widespread it is.

And the bottom line is that it is very widespread, but there are some limitations that

Weinberg acknowledges in the study.

First of all, he invited as many organizations as possible to participate.

Only 43 of them participated, and it's possible that they are self-selecting in some fashion.

But

what do you mean by that?

It's possible that some museum or some library saw the request and was

befuddled by it.

And we're like, what are you talking about?

We don't have this problem and didn't respond and obviously if the library did experience something they did you know they they chose to participate right um so

those 43 respondents um

which we could talk you know in some more some more detail about the data but um they're anonymous a so they can speak more freely about what they're seeing and share some more private data and analytics about like how much traffic they're getting and what is knocking them offline.

And I would say most importantly, and I think this would probably be familiar to you from security reporting,

they don't want to be too specific, too specific about who they are and what they're doing to stop the AI scrapers.

So the scrapers don't learn about the countermeasures and then can better circumvent them.

So yeah, that's kind of who's involved in this and why they're anonymous.

Yeah, do we have any idea what sort of scrapers or specific scrapers we're talking about?

Like, are we talking about JATGPT or anything like that?

Or do the libraries not know?

And that's part of the problem.

You know what I mean?

There's also an attribution problem, right?

It's hard to say for sure.

Some of them, so the report doesn't name the specific scrapers, but we do know from experience

that anthropic, sorry, not anthropic, perplexity, for for example, in the past has been caught ignoring robots.txt,

which is this file that site owners can put in their website to tell bots not to scrape it.

And in the past, this was sort of like an accepted norm that was respected, but increasingly, as this training data is becoming more valuable,

they're ignored.

And like Perplexity is one that has repeatedly ignored robots.txt.

Others self-identify, whether they ignore the robots.txt file or not.

They self-identify what the bot is.

And other times

the organizations can make a guess based on the IP ranges that are hitting them.

They're like, oh, these IP ranges are clearly coming from Alibaba.

So it's safe to assume that Alibaba is scraping this website for AI training data based on the behavior, but it's hard to say for sure.

It's possible that somebody is using Alibaba infrastructure, but it's actually a different company.

Yeah.

So what's some of the impact?

You say that like some get knocked offline or maybe take themselves offline.

Somebody here has used a quote of a DDoS attack comparing it to that.

Like what's some of the concrete impact that these scrapers are having on like these open databases and archives and kind of ruining it for everybody?

Yeah, so one interesting thing about the report is that in the vast majority of cases, the only reason that an organization knows this is even happening is because their services are degraded to some noticeable degree.

The site slows down, it's not accessible at all.

This was just a coincidence, but last week, I think on Friday, University of North Carolina Chapel Hill, which is a big university, a research university, has a very robust kind of online library full of books and papers, and it's something that students use, teachers use, just the public can use.

And

they found out that this was happening to them because nobody could access it,

which is very disruptive to the organization and the student and the teachers and all of that.

And they have a big IT department and they solved it by deploying some new kind of firewall that, again, they don't want to talk about in in too much detail.

So, people don't learn how to circumvent it.

But that's sort of like a typical example of how

people know that it's a problem.

The impact is that these resources

that

exist

for the public, and the whole goal of these organizations and of GLAMI,

where Weinberg works, is to make this cultural heritage, as he calls it, available to as many people as possible.

That's the mission of the organization.

It's like, oh, there's a little museum in France that has like a bunch of manuscripts that you can go see if you visit it.

But wouldn't it be great if they just digitized everything and made it available online?

And it's like, yes, that would be great.

But then that opens them up to these scrapers, and all that data is very valuable now.

So the impact is that the public no longer has access because it's being hoovered up so aggressively by all these different AI companies.

Yeah, like people want this data to be accessed by the public for the reasons you just laid out.

But it seems that the trade-off is you make it publicly accessible, you get swarmed by all of these bots, which are going to degrade the archive and, you know, potentially knock it offline or whatever.

Is there basically...

nothing to be done

with that trade-off?

Like, is it sort of, I mean, this is a bad way to put it, but like the cost of doing business because it's not a business, but you see what I'm getting at?

Like, is there just nothing to be done or what?

So, there are things that people can do.

The response to the story

has been very interesting because I feel like I've heard from a bunch of other institutions, which I don't know if they're included in this survey because it's anonymous, but judging by their response, I think they were not.

So the problem is, like, again, demonstrably widespread.

And people have kind of been telling me

interesting things about what they're doing and what their solutions are.

And I hope to have a story in the next couple of weeks about some interesting solutions.

I want Jason actually to talk about some solutions that he reported on.

Cloudflare has a thing.

There's kind of like these funny solutions to trick the scrapers.

But I also like to talk about this tension, I'm going to

make

a tortured analogy, but it's something that I talked to Weinberg about.

But

in 2023, this book came out called The Art Thief.

Has anyone, have you guys heard of this at all?

Really great book, nonfiction.

It's about one of the most prolific art thieves in history.

He worked in the early 2000s in Germany and France, and he stole more than 200 pieces.

And the way that he did this is he didn't steal like gigantic, famous pieces.

He wasn't going after the Mona Lisa.

He just went to these small regional museums in the countryside and like stole tons of small pieces.

Altogether, they were worth like, I don't know, $2 billion or something like that.

And it's a really fascinating story about why he did it, how he did it, what happened when he got caught, and all of that.

But one of the lessons of the story is

that

the author talked to the owners of these original museums, and they explained that when something gets stolen from one of these museums, the damage isn't only that the piece is gone, it's that it breaks the social contract of how these museums operate, right?

It's like

these museums might have a security guard, they might have security cameras, but it's not like Ocean's 11, right?

There isn't like lasers and like heat sensors keeping the pieces safe.

The social contract is

this art, this history, these texts are part of our collective cultural heritage.

And we're putting in the work to make it available to the public because it belongs to everybody.

And

the public, in return, kind of like agrees to be respectful and not fuck with it.

And when somebody steals something, they break the social contract, and that forces the museums to lock everything down and make it less accessible.

And this is kind of what is happening online as well.

So one thing that people can do, right, the people who manage these collections, they can have people log in, they can have CAPTCHAs, they can have all kinds of like friction that

would make it

hard, if not impossible, for an AI scraper to get all the information, but would require a little bit more from human users as well.

And the maintainers of these collections are very reluctant to do that because the entire point of doing this, like the entire point of digitization and GLAMI and all this stuff, is to make it as accessible as possible, right?

It's like a

very

benevolent mission that these people have.

So there's that issue.

Like they're reluctant to do it because they want to make it so available.

And then the other thing is that,

and this is something that Weinberg really emphasized,

even at,

and he focuses on small and medium-sized organizations, but he says, even in like a big organization,

once they digitize something, there's like one person maybe

who is responsible for keeping that stuff online and functional.

It might be someone's job on top of a totally different job that they already do.

It might be a volunteer.

And any change or update that you force them to introduce is very, very difficult to implement if not impossible, right?

It's like if you go to one of these organizations, you go to one of these small museums and they're like, hey, we need to implement a CAPTCHAR.

We need to implement a login.

We need to implement something like that.

They're like, well, we're just going to take this offline because we can't do this at all right it's just like impossible for us um to to to put in the work on top of what we already did to to digitize it so we're just not going to have it at all um jason do you remember the the the cloud flare like yeah yeah so i mean this is it's very grim for the reasons that you just said because a lot of these organizations are probably uh like barely financially solvent, depending on who they are and what they are.

And it's expensive to keep this sort of thing up.

And then that's to say nothing about the status of like the actual things that are being scraped.

I assume some of them are in the public domain by now.

If they're like really old, a lot of them probably are not.

But we found that, you know,

AI companies don't really care.

I did write a long time ago about different types of mazes that have been deployed.

There was one that was like a DIY open source one

by a specific programmer.

We probably talked about it maybe six months ago or maybe a little longer than that.

He called it an AI tar pit, and it was just like an infinitely generating website that a human being would click off of pretty much immediately, but that an AI scraper would scrape over and over and over again, kind of indefinitely.

For something like a museum to spin this up, it's like part of the point of an AI tar pit is to waste a scraper's time and

like by creating infinite number of pages, which doesn't do anything really to help the museum because that uses a lot of their own bandwidth because they are allowing the scraper to hit it.

They're just hitting like nonsense over and over and over again.

But Cloudflare, the gigantic internet infrastructure company, released something very similar to this.

It's like a similar design, and that is something you can put in front of it now.

You can also, as you said, you know, put a login wall

sort of depending on the scraper, like they may or may not want to try to get past that.

And, you know, that's something that we did to preserve the cultural works here at 404 Media

was put them behind a login wall sometimes.

And I think that that's helped.

I think I'll probably talk about this more later, but I went to a journalism business conference two weeks ago for 404 media to talk about this.

And a lot of big news outlets were talk where they're talking about how they are trying to protect their own sites from ai scrapers and i believe it was the daily mail was there uh and they gave a presentation about the fact that you sort of need to

stop these scrapers very early on and

also catch them in the act more or less so that you can then go to the company and say like we know that you are scraping this when you should not be.

And

for the Daily Mail, it was for the purposes of trying to strike a deal with OpenAI or with these different companies, like, say, hey, I know you're trying to steal this stuff.

We have stopped you with our login wall or with our, you know, robots.txt or the various things that they're doing.

Like, let's strike a deal here.

But one of the points that their business person was making was like, once this stuff stuff is scraped, you kind of like lose a lot of your leverage unless you're willing to sue them.

And that can be really expensive.

We don't even know if it's going to be successful.

There's tons of lawsuits out there right now that are still ongoing and that, you know, we have been following and will continue to follow.

But for something like a museum,

it's interesting because I bet their collections don't change all that often.

They are getting hit by these scrapers.

The value has already been like extracted, probably in many cases.

But the way that these scrapers work, they're probably coming back over and over and over again and hitting them over and over and over again, even though they've already gotten what they want, which is like really frustrating.

Just for that,

to illustrate that, like the UNC thing that happened last week, the IT people were explaining that

the information is easy to get, but they have a search engine.

And what the bots were doing were just like spamming it with different search terms.

So it's like an incredibly inefficient way of extracting the data.

You know, it's like if they, if, if, if it was just like an agreed upon,

uh, hey, can we please have your data?

We'll pay this much for it.

It would, the, the, the, the organization could maybe benefit from it.

And then also you, you won't have to like DDoS the library in order to get it.

I mean, or if, or if it's like, okay, scrape us like once a year or once every six months, Don't scrape us like constantly.

I mean, ideally, scrape us not at all, but it's like, please don't scrape us daily.

And then the other thing is just like, there's constantly new

bots that are doing this associated with new companies.

Companies that already exist are creating new bots to scrape for different purposes.

And so it's not like you can just protect against, you know, OpenAI scraper.

You need to protect against.

all the different types of scrapers that

different companies might be running

remember like which one what the names are keep up to date with uh what they're called so on and so forth and and figure out how to block all that traffic and it's like extremely not trivial there's a couple different products that have been released to try to automate this but it is still like

it's it's a permission structure that's like really up because it's opt-out not not opt-in and you don't even know like what you're opting out of because there's constantly constantly new ones that you have to think of and there's like new strategies that the AI companies are using to circumvent robots.txt because they don't care for the most part.

Yeah.

And opting out is, as you say, not straightforward.

You have to like fight for that through technical or potentially legal means.

All right, that was fascinating.

I'm definitely interested to hear what else we find about that.

If you are listening to to the free version of the podcast, I'll now play us out, but if you are a paying 404 Media subscriber, we're going to talk about the, frankly, casual surveillance relationship between ICE and local cops, and that's according to internal emails that Jason got.

You can subscribe and gain access to that content at 404media.co.

As a reminder, 404 Media is journalists founded and supported by subscribers.

If you do wish to subscribe to 404 Media and directly support our work, please go to 404media.co.

You'll get unlimited access to our articles and an ad-free version of this podcast.

You'll also get to listen to the subscribers only section where we talk about a bonus story each week.

This podcast is made in partnership with Kaleidoscope.

Another way to support us is by leaving a five-star rating and review for the podcast.

That stuff really helps us out.

Here is one of those from Spellchecker.

The 404 team does a fantastic job at everything they do.

Independent journalism for the win.

I feel like I I've never read that one before.

I'm sorry if I did.

This has been For Reform Media.

We'll see you again next week.