This is high enough that there should be a market to compensate the end users who created these
I'm astonished that a picture turns out to be worth a thousand words.
Users of these sites have had license agreements and privacy policies for a long time, and freely gave away their content just because free web hosting was worth it. Why would they be entitled to anything more now that this content have found new value?
What does the long game look like for raw training data? How will AIs maintain the quality of their diet?
To compare, web search started — in the early days of Google — as a huge win because so much valuable information that was scattered around became findable. But over time it has become whac-a-mole with spam and AI copypasta, and now it's a struggle to keep returning good results, for any search engine.
These AI systems are being build on top of all the collective effort and resulting knowledge of the entire humanity. We can pretend they are just another private enterprise or we can acknowledge that they are something more than that.
And it's not just the productivity we could achieve with democratizing these systems. There's another danger. When big companies buy up all this intellectual property, what better choice would they have than to lock it up? At least until recently you could argue that IP rights owners were as entities incentivized to proliferate this knowledge, now the opposite is happening.
Or another twist pay people to submit ten years of emails (upload the backup file) or just pay small amounts for works they’ve made. College essays, journals, etc.
I don't think college essays, etc would contain anything novel. Future techniques could smoothly interpolate better creating ever-anew wordmud.
1) Based on how pre-verbal children learn, one nitpick is that I strongly suspect we need to give AI touch and a sense of space in order to truly understand quantity, causality, object permanence, etc.
2) Something that is not a nitpick: even a superhuman multimodal AI wouldn't have direct access to human emotions, sexuality, ideas of natural beauty, etc. I don't think humans have run out of interesting things to say about these ideas.
(In particular, I don't think a superhuman AI is capable of understanding music unless it is directly emulating the biological processes by which humans understand music. The issue is not "logical" - melodies don't actually make sense analytically.)
That's quite a proposition.
They're very clear its going into an AI generated article on the topic but you better believe that is also now core training data.
Gödel probably consumed a miniscule fraction of what these systems have seen. And look what he came up with!
We already see that if you want to focus on a narrow skillset you can use a much smaller model and training set. But right now it is a race because everyone wants to be the one true generalized intelligence model.
E.g. how often does a baby go out and experience something novel? The majority of it's time is spent getting the same stimulus over and over again, as anyone listening to childrens television can attest.
Humans learn in fundamentally different ways to our current systems and information poverty is not a problem for us.
In general AI researchers have done a very bad job exploring how a system might be "near-human" according to some fancy linguistic benchmark, yet dramatically dumber than a pigeon in terms of general reasoning abilities.
However, the first complex nervous systems came about in the Cambrian explosion, only about half a billion years ago. And we also don’t train LLMs by random mutation and selection, it’s a much more teleological process.
But to extend the analogy, we should be able to train a model continuously, and not have to start training from scratch for each new model. Although, maybe, that would require random mutations, and thus much more time?
GenAI is just doing the same thing on a larger scale.
same thing with emulators and roms. somebody dumped the cartridges (copyrighted software) into ROM files to be played on emulators (copyrighted bios) but they were "archiving" and if you owned the original copy you could download them. I still vividly remember seeing on warez website disclaimer: "DMCA SAFE HARBOUR NOTICE: YOU MUST OWN THE ORIGINAL GAME OTHERWISE ITS ILLEGAL BUT YES, YOU CAN DOWNLOAD EVERY SINGLE GAME MADE ON THAT CONSOLE FOR FREE"
I feel like the same outcome will be for LLMs trained on copyrighted material. It will be "training". The net benefit is too great than fretting over "training"
tldr: "indexing" ---> "archiving" ---> "training"
Disclaimer: Long on $DGX
>Photobucket declined to identify its prospective buyers, citing commercial confidentiality.
>tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps
In this market, ethics seem to exist when it comes to corporate clients, but not when it comes to end-users.
It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.
What's morally bankrupt about that? It costs money to host your photos and they're a business that can decide to charge their customers any rate they think the market will accept.
I can think of worse things than that which might be hidden away for public scraping.
A lot of people just were not paying attention to the game being played, and so now they're getting played themselves.
Unfortunately, they did actually. It's more accurate to say that they were presented a EULA and Terms of Service that no reasonable teenager would have had any hope of understanding. But since they're over 13, they're held to the terms of those agreements in any case.
These companies are slimy. Make no mistake, this will get worse in the future.
Would it be attractive for a company like Twilio or Aircall to offer free phone calls and sell anonymized recordings?
Remember a decade or so ago, you could call a 1-800 number and look up phone numbers using your voice? It was backed by Google and once Google was done collecting the data, they shut it down.
But if it was part of the terms of the new free service, and all the parties involved got a reminder message on the call… you might still not like it, but it doesn’t seem like it would be a violation of privacy
While true, it's META who has won that arm's race long ago in my view; hell, they just disclosed that they have private access to DMs to Netflixh [0] in a lawsuit.
If you don;t think they are training their own models on this data over all their platforms you have to be a complete idiot o: Facebook, Instagram, Whatsapp.
That is a much larger treasure trove given the sheer scale of people on those platforms, Google is limited to mainly Android users and those who use it's suite on PC (relatively small compared to social media users), which excludes most Mac users.
The thing they don't tell you about this dark underbelly of AI is just like the (meta)data that is for sale to 3rd parties, it's tiered price structure wherein Mac users are often the premium tier de to their more 'affluent' status and likelihood of impulsive in app purchases.
This is why I think META already won the AI race, they opensource Llama and have the a massive treasure trove of data to refine and train when they see what the OSS community creates that is of actual value: ChatGPT/DALL-e runs at a loss for MS/OpenAI. But if anyone can monetize this gold rush it will be META.
And perhaps more critically from an infrastructure POV, Llamma now runs better on CPU [1] rather than GPU, which means they won't have to be constrained or price pinched on GPUs like Microsoft, Google, Amazon likely will due to demand constraints from Nvidia (see ETH mining craze during COVID). They can focus on optimizing their data centers with more free cash flow which meant they can have a bigger footprint for when they finally figure out how to properly monetize this AI bubble, because it is is a bubble, from now until then.
I think Zuck learned from Libra that staying out of the limelight during a bubble is critical if he wants to undo the Metaverse money-pit/losses.
0: https://www.movieguide.org/news-articles/facebook-allowed-ne...
https://www.appmysite.com/blog/android-vs-ios-mobile-operati...
Random link. Can't vouch for it. But US and RoW have quite different patterns.
https://www.usnews.com/news/top-news/articles/2024-04-05/ins...
Because no one will sell them an exclusive license to the data.
The companies selling this data are slimy. They're borderline crimelords. Picture a pirate captain with a hostage that he is ransoming. Now imagine he gets his ransom, but before he releases the hostage he makes a copy of her. Then ransoms the copy to another interested party. But before he releases the copy, he makes another copy and... you get the idea.
It's pirate thinking.
"If one hostage is good? Then two are better! And three? Well, that's just good business!!!" -Hondo Ohnaka
The future is having your own personal AI assistant, completely free of charge, which is suspiciously eager to recommend shopping at Temu and eating at McDonalds.
Even if a future service doesn't have an obvious charge or subscription, just because you don't recognize how you're being exploited doesn't mean it's truly "free."
There's a reason advertising exists as an industry at all, let alone a global trillion-dollar one. Today's "free" is actually paid for by exploiting user attention and attempting to hack your brain--sometimes in ways that are culturally accepted due to long tradition of use, sometimes in new disturbing ones.
That said, AI tech is or is quickly becoming freely accessible; unless they have a USP, free / homemade versions will end up competing with the paid services, and it's hard to compete with free.