AI is killing the internet
This episode was made in partnership with Vox’s Future Perfect team. It was produced by Gabrielle Berbey, edited by Amina Al-Sadi, fact-checked by Rebeca Ibarra, engineered by Patrick Boyd and Andrea Kristinsdottir, and hosted by Sean Rameswaram.
Listen to Today, Explained ad-free by becoming a Vox Member: vox.com/members. Transcript at vox.com/today-explained-podcast.
Noted fan of the internet Al Gore with his boss at the time, President Bill Clinton. (Photo by Sharon Farmer/White House/Consolidated News Pictures/Getty Images)
Learn more about your ad choices. Visit podcastchoices.com/adchoices
Listen and follow along
Transcript
Artificial intelligence is scraping the internet.
It's gorging all the websites to give you what you want.
It's actually kind of gorging everything to give you what you want, and the makers of everything are not very happy about it.
Sarah Silverman is suing, Sony is suing, Dow Jones is suing, the New York Times is suing, authors are suing.
But in one author lawsuit, AI kind of won.
Specifically, Anthropics AI, who goes by Claude?
Well, Claude's not cool, but Claude's uncool the same way I'm uncool, see?
So
Claude's win in court is scaring the makers of everything, and we're going to talk about why on Today Explained.
Thumbtack presents Project Paralysis.
I was cornered.
Sweat gathered above my furrowed brow, and my mind was racing.
I wondered who would be left standing when the droplets fell, me or the clogged sink.
Drain cleaner and pipe snake clenched in my weary fist.
I stepped toward the sink and then.
Wait, why am I stressing?
I have thumbtack.
I can easily search for a top-rated plumber in the Bay Area, read reviews, and compare prices, all on the app.
Thumbtack knows homes.
Download the app today.
With a Spark Cash Plus card from Capital One, you earn unlimited 2% cash back on every purchase.
And you get big purchasing power.
So your business can spend more and earn more.
Capital One, what's in your wallet?
Find out more at capital1.com/slash Spark Cash Plus.
Terms apply.
Today, Explain from Vox.
I'm Sean Ramas from here with Jason Kebler, tech reporter and co-founder of 404 Media.
I am a journalist who covers AI, but I'm also a business owner because we have our own small publication.
And so I'm very interested in what is going to happen with all of these
AI companies getting sued on copyright grounds.
There's dozens of lawsuits at this point, and I'm concerned about it both as a journalist who has had my work scraped, but also as someone who has like a direct financial interest in it.
And so about a month ago, there was this decision in a case against Anthropic, which makes the AI tool called Claude.
And it's not necessarily that this is the biggest AI copyright case, but is the first real major decision where we get a judge sort of pointing at how he is thinking about these issues of massive AI companies scraping authors' work, scraping artists' work, scraping musicians' work.
And who sued Anthropic?
Yeah, so it's three authors.
Their names are Andrea Bart, Charles Graeber, and Kirk Wallace Johnson.
Three authors claim Anthropic built a multi-billion dollar business by misusing copyrighted works and pirated writings without permission and without paying the authors for their work.
This lawsuit is really just the latest as many other authors, journalists, record labels, artists, creators, they try to wrestle back control of their work.
To be totally honest, I didn't know them before this lawsuit.
To be totally honest, I still don't.
They sued them because they learned that their books were included in this data set called Books 3, which is this really controversial at this point data set that contains a few hundred thousand books and the Atlantic at one point got a copy of Books 3 and then published like this search tool that allowed authors to see, is your book in this data set?
Author Drew Hayden Taylor had no idea
that nine of his works were part of Books 3, a massive data set used by tech companies to train artificial intelligence.
Well,
it's a combination of being flattered and being concerned.
We're all just like little ants who don't mean anything to the big billionaires.
They don't want to pay us for our words.
They'd rather just take it.
I'm so mad.
If your book is on here, I'm so sorry.
I'm just like so sad for so many authors today.
These authors learned that their books were in books three, Anthropic trained on books three, and therefore Anthropic trained on their copyrighted works.
And so that formed like the basis of this lawsuit.
So the really interesting thing is that in the early days of
this debate, and it's like one of the hottest debates at the moment between artists, journalists, authors, and like the AI boosters and companies and maximalists is
is it fair use to
scrape this stuff en masse, run it through a large language model, like turn it into a huge data set, and then use large language model technology to create these tools.
And at first, the AI companies were very skittish about saying that they had trained on copyrighted work at all.
AI should be allowed allowed to read the internet and learn.
Shouldn't be regurgitating.
Shouldn't be
violating any copyright laws.
But on individuals' private work, yeah, we try not to train on that stuff.
We really don't want to be here upsetting people.
But as these cases started going to court and as they entered discovery and as it became clear that
every major AI company was training on copyrighted work, their argument went from being,
well, we can't say what we trained on because this is proprietary to, of course, we trained on copyrighted work.
We had to, and it's legal.
And it's legal because our use of it is transformative and therefore it's protected by the fair use tenet of copyright law.
Section 107 of the Copyright Act reads, transformative uses are more likely to be considered fair.
Transformative uses are those that add something new with a further purpose or different character and do not substitute for the original use of the work.
That's what they argued, and that's what the judge ultimately decided.
What he decided in this case was the scraping of these three authors' books was considered fair use under copyright law.
But there is a huge caveat here where he decided that the way that Anthropic went about acquiring the books in the first place was piracy.
Okay, so the judge essentially hands down a split decision saying that, yes, this is fair use to use these authors' work this way, but also it wasn't totally fair how you got this stuff because it was pirated.
So I don't know, what does that mean?
Does everyone go home unhappy?
Or was this like a huge win for Anthropic?
Doesn't feel like a huge win for the authors.
Yeah, I mean, I don't think it's a huge win for anyone yet.
And I think that the people who are saying this is a slam dunk for Anthropic, which many people in the AI world are saying it's a huge win for Anthropic,
I think they're wrong.
And the reason that I think they're wrong is because
the judge determined essentially that it was not copyright infringement to train
Claude on copyrighted material that was legally obtained.
But then they also downloaded books from this website called LibGen, which is a piracy website that has millions of books on it.
And then also from a website called Pirate Library Mirror, which is another piracy site that has millions of books on it.
And the judge said that obtaining the books in this way was pretty much like cut and dry copyright infringement.
And I think the really important thing to note is that every major AI company has trained on copyrighted works that they obtained in a similar fashion.
We have done reporting at 404 Media where
entire YouTube channels were scraped,
Netflix, like the entirety of Netflix was scraped.
And so
the specifics about how these companies obtained these works is potentially going to be really important.
And a lot of that scraping has already been done.
A lot of that piracy has already been done.
These companies are literally some of the richest companies on earth, are affiliated with some of the richest people on earth.
Did they really just steal all these books?
Could they not have just gone to Amazon and bought like some books?
Or is that just too much work for them?
Well, so the super interesting thing about this lawsuit and something that like really like, I was like, holy shit, like, how did they do this?
Why did this happen?
Is in the beginning, Anthropic pirated all these books.
They downloaded huge amounts of torrents.
They scraped these piracy websites.
And they did that specifically because they didn't want to slow down.
Like there's an email that is part of this lawsuit where the CEO, Dario Amadei, says, you know, we don't want to get into,
he calls them legal slash practice slash business slog.
And so.
They were basically like, let's do all of this.
Let's pirate all the books.
Let's put it into our model.
And then let's go buy copies of a lot of other books.
And so what Anthropic did was they had a whole team of people who was dedicated to buying used books from used bookstores that were going out of business, from eBay, from these online marketplaces.
And they bought a huge, huge number of books, like physical books.
They tore the covers off of them and they had this like giant scanning operation where they would scan the books and then create a digital copy of the books and then fed that into their model.
And the judge said that all of those books that were bought from used bookstores, no problem.
And I think that goes to show that
these AI companies are grabbing data from wherever they can find it.
It's like a, it's a huge arms race to see who can get the most data from the most number of places.
And so they're doing like the low-hanging fruit, which is downloading
everything.
Yeah.
But then they're like scouring the planet looking for like bookstores that are going out of business like i've i've heard of ai companies looking for like like huge uh physical archives of like vhs movies and and things like that and then digitizing those and so really they're just trying to find data wherever they can and it seems like when they're able to get it legally by purchasing a copy, they're willing to do so, but they're also willing to take it for free when they can.
Did we learn anything from this lawsuit that might implicate those other ones?
Yeah, I mean, I think that the piracy aspect of this is really important.
And we've seen in the past, like if you are a 13-year-old kid who's pirating Metallica songs on Napster, like you can be liable for hundreds of thousands of dollars worth of damage.
Lars will find you.
For just like a few songs.
And like in this case, you have 7 million books.
And so
like it will be very interesting to see whether a judge
levies like a huge financial penalty here or whether it's more of a slap on the wrist.
And I tend to think it will probably be more of a slap on the wrist because all of Silicon Valley, all of America's largest companies sort of have a huge amount of investment.
riding on the widespread adoption of AI.
And AI is now a huge part of the American economy.
It's become part of of like geopolitics as well, where you have the Trump administration and really the Biden administration was saying the same thing.
Come on, man.
Saying that the United States can't fall behind China in the quest to innovate in AI and to have like widespread AI adoption.
I'll be very curious to see whether there are like actual
like serious punishments for these companies that have scraped all of this data or whether they you know wiggle out of of it with a slap on the wrist or get out of it with a series of settlements, or, or what have you.
But I tend to think that there's probably no stopping this
industry from a legal perspective.
I think that it feels too big to fail to me at this point.
404media.co
is where you can find and support Jason Kebler's work instead of, you know, just stealing it.
AI companies aren't just stealing everyone's intellectual property.
They're also kind of killing the internet as we know it right before our eyes.
We're going to talk about that when we're back on Today Explain.
Support for this show comes from Robinhood.
Wouldn't it be great to manage your portfolio on one platform?
With Robinhood, not only can you trade individual stocks and ETFs, you can also seamlessly buy and sell crypto at low costs.
Trade all in one place.
Get started now on Robinhood.
Trading crypto involves significant risk.
Crypto trading is offered through an account with Robinhood Crypto LLC.
Robinhood Crypto is licensed to engage in virtual currency business activity by the New York State Department of Financial Services.
Crypto held through Robinhood Crypto is not FDIC insured or SIPIC protected.
Investing involves risk, including loss of principal.
Securities trading is offered through an account with Robinhood Financial LLC, member SIPIC, a registered broker dealer.
Avoiding your unfinished home projects because you're not sure where to start?
Thumbtack knows homes, so you don't have to.
Don't know the difference between matte paint finish and satin, or what that clunking sound from your dryer is?
With thumbtack, you don't have to be a home pro, you just have to hire one.
You can hire top-rated pros, see price estimates, and read reviews all on the app.
Download today.
This episode is brought to you by Marshalls, where you never have to compromise between quality and price.
The buyers of Marshalls hustle hard, working to bring you great deals on brand name and designer pieces.
Because Marshalls believes everyone deserves access to the good stuff.
Visit a Marshalls store near you or shop online at marshalls.com.
Today, Explain is back with John Herman now.
He's a tech columnist at New York Magazine.
John, in the first half of the show, we were talking about how this anthropic case and judgment, you know, may or may not change the extent to which these big AI models can scrape the internet.
But I want to talk to you about how all this scraping has already in some ways broken the internet as we know it and how we use it.
You wrote about how AI has broken maybe like, you know, the front page of the internet for a lot of people.
Google.com.
Tell us how.
Google could not be closer to the center of like this recent AI boom.
On one hand, they are a company that has really deep roots in that space.
They published like the foundational research for what then became generative AI as we know it.
They've put it in all their products.
If you use any Google thing, you are seeing like chatbots everywhere.
Take notes with Gemini.
summarize this file, summarize a folder, refine this document, find inspiration, easy, fresh ideas, elevate your writing, get clear, constructive, improve sentence flow, word choice.
They are all in on AI.
Google search in particular has AI overviews at the top.
There's a new AI search mode that works like a chatbot instead of a search engine.
Google making a rare change to its homepage, the most visited website in the world, pushing its AI mode tool directly into the hands of its billions of users.
With this latest move, it is changing what billions of people see when they open their browsers, still the on-ramp for the entire internet.
Meet AI mode.
Ask detailed questions for better responses.
AI on Google search can provide information.
While that was all happening, AI was also sort of accelerating this feeling of decline in the Google product, which over the years, through this back-and-forth battle between the company and search engine optimizers and companies trying to get an edge on Google and this sort of long-running dynamic had become a little spammy, a little overloaded with ads.
Have you noticed that Google sucks lately?
I'm talking about their search.
It sucks.
Why is it so hard to find anything on Google search?
Google search is terrible.
It's bought and it's sold five or six links up top, all paid for.
It's just garbage, pure, unadulterated garbage.
But I think a lot of people would agree that using Google in, say, 2023 was a kind of a degraded experience compared to 10 years prior.
It was kind of cluttered.
There was more just junk in it.
There were more ads all over the interface, but also the stuff you were getting in search was a lot of low-quality, cheaply made, aggregated content, stuff that was taken from somewhere else in an effort to sell a product or just serve up some ads.
The arrival of generative AI tools, which enable like the creation of basically infinite passable content almost for free, really accelerated that issue.
So, on one side, you have the big ecosystem that Google guides people to that is in a sort of collapse because of this massive shock of new AI-generated content.
On the other side, you have Google, the product, becoming more and more AI-centric.
And in the middle, you have kind of a complicated story and honestly, for search users and regular people, kind of a strange experience.
Do they have a plan to make money off of this?
Obviously, they want to make money.
Has anyone asked what their long-term plan is?
So there are obvious risks to throwing away this like cluttered but lucrative product and replacing it with a totally clean chat bot or whatever.
That's not what they're doing.
They are incorporating
AI answers into the main search page, which they say people like quite a bit.
So this last quarter has been really good for them.
It also arrived in the context of lots of like really strong data data suggesting that the way people use Google search now with these AI tools means that they don't really leave it anymore.
They don't really click out and go to anything.
An AI overview might summarize three articles, archival resource, some expert opinions, but the number of people that actually then click through to those opinions or to those articles is minuscule.
So Google's relationship to the web around it is pretty, pretty dramatically different.
If Google's like eating up the rest of the internet, if Gemini is eating up the rest of the internet right now,
and companies like ours, let's say,
are no longer, you know,
meeting their traffic goals, are no longer getting any traffic from Google at all?
Like, does Gemini have like nothing to eat?
You know what I mean?
Because everything dies?
Who's going to be feeding Gemini all the right answers in like 10 years?
We're sort of like glorifying the web a bit in this conversation.
No matter how great and incredible it is as this as this big resource, it really doesn't go that deep.
And the idea that it is now being sort of like trawled and overfished and just sort of consumed like a resource by these AI companies really does, I think,
raise the specter of like, of collapse.
I do think that they could find that their products are being made worse by this dynamic and by their relationship with the web.
I do think that's a real problem.
And you can see this in some of the deals that these companies make with publishers,
including our parent company, which has a deal with OpenAI, for example.
Remind people out there or me why companies like ours make deals with companies like ChatGPT.
The context is Every media company is struggling for visitors.
Even before the Google traffic really started to collapse, it was sort of unstable.
And so in addition to like a weak advertising market, every media company is looking for any sort of additional source of revenue.
And if you're a media executive, OpenAI showing up and saying, here is this many millions of dollars for this many years, it looks like free money.
Of course, if you're like producing the content, or if you're even just thinking longer term about how
a media company or website fits into this AI picture, you recognize that you're sort of, you know, giving access away to something that these companies are explicitly trying to automate.
You know, you're sort of like,
in an institutional sense, training a replacement.
You're listening to AI Explains today.
But it is a deal made not quite under duress, but something close to that.
For people who miss that old version of the internet, who miss going to Google, typing in a query, getting a bunch of results, clicking on a few of them, getting answers that felt credible,
where do they go for that experience now?
I think there's like a funny polarized answer to this.
I just did a story on Reddit, which is having a huge moment right now.
It's been around for 20 years.
It's growing hugely.
And part of it is just a response to
you know, social media fatigue,
the sense that other communities on the web don't really exist anymore, that everything else on the web is too commercial and whatever.
Also, a huge part of that growth is just traffic from Google.
They're having the fastest growth they've had in almost their entire existence because Google is just shoveling so many people into Reddit because everything else is not really like
working.
So you have that.
You have a community of communities.
You have something that feels kind of like it's of the old web.
It seems like eventually we're going to get to the point where it's like you either want to talk to one of these large language models or you just go back to like calling up your friend.
I don't even know where it gets.
You just walk it, you just get, walk into the street and yell, does anyone know of a good farmer?
Yeah, I mean, it's like a real, the mutual suspicion about who's using AI is, is really pervasive, and especially, especially online, but also also in person.
But yeah, I do think that the way that the
AI like training paradigm and some of the stuff that you were talking about with Anthropic, but also just the way that Google incorporates all this stuff.
It really does kind of break the deal with the whole idea of the public web.
Like, all right, we'll all just do this stuff in public.
We'll talk to each other.
People will build all these businesses around this to sort of connect everything and it'll all sort of work together and whatever.
When you have like these massive sort of predatory companies just consuming all of that, harvesting all of that and saying, all right, we are no longer part of this arrangement.
We are doing something else.
More people are on Discord.
More people are in group chats.
More people are either just purely consuming on social networks and not posting or just talking privately with their friends.
And I do think that this fits quite well with that trend and probably accelerates it.
John Herman, you can read and subscribe to New York Magazine at nymag.com.
Gabrielle Berbay produced, Amina Al-Sadi edited, Rebecca Ibarra fact-checked, Patrick Boyd and Andrea Kristen's doctor mixed.
And by the way, Vox's Future Perfect is funded in part by the BEMC Foundation, whose major funder was also an early investor in Anthropic, and none of them have any editorial input into the stuff we make here at Vox.
Speaking of stuff, we hope you enjoyed the 1700th episode.
If you did, you can say something nice about us most anywhere you listen.
And if you didn't, well,
there's always episode 1701 tomorrow.
With a Spark Cash Plus card from Capital One, you earn unlimited 2% cash back on every purchase.
And you get big purchasing power so your business can spend more and earn more.
Capital One, what's in your wallet?
Find out more at capital1.com slash SparkCash Plus.
Terms apply.