June 2023 Data Dump is missing (opens in new tab)

(meta.stackexchange.com)

556 pointsJasonPunyon2y ago257 comments

257 comments

This, along with recent Reddit goings-on has made me realize a major risk with the current structure of online communication. Take either Reddit or Stack Exchange as examples. They build a platform, and users contribute their time, thought, energy, and knowledge to build a community on that platform. Those companies can then gatekeep and restrict access to all that the community built, when all they did is provide the platform, and store the data. We need to rethink this model.

The thought and knowledge of communities and users need to belong to those communities and users. To people they intentionally and thoughtfully delegate to and trust. We need to decentralize our communications, like how the internet used to be before the arrival of social media and mega forums. We need to revert to small, focused forums, with less anonymous, more persistent communication, run by people we trust. Otherwise, we will continue to see mega companies harvest our data and use it (or not provide it) against our wishes. If we don’t work to mitigate that dynamic, we have nobody to blame for the poor outcomes but ourselves.

b3morales2y ago

This was one of the promises originally of Stack Overflow: all the content is Creative Commons licensed so that if they "turned evil" (I believe it was Joel that put it this way) the community could, in a way, create a fork. https://web.archive.org/web/20230203170609/https://stackover...

Unfortunately the dumps themselves are not a legal requirement, just a gentleman's agreement, so realistically exercising this ability was still at the whim of the company.

redbell2y ago

> This was one of the promises originally of Stack Overflow: all the content is Creative Commons licensed

This reminds me of the promise OpenAI was built on. Unfortunately, it turned out to be a bold claim to be respected and too good to be true [0]

0. https://news.ycombinator.com/item?id=34979981

isoprophlex2y ago

Maybe just a gentlemen's agreement, but a nice canary too. Once the dumps stop, it's time to start waving middle fingers and GTFO.

sshine2y ago

I stopped answering questions when Monica got sacked as a moderator:

https://meta.stackoverflow.com/questions/393046/who-or-what-...

To me, this was the canary. Just another psychopath megacorp.

1 more reply

theragra2y ago

I always wonder why original founders just sell the company and do something else. Why don't they try to control it more and make sure it stays aligned with needs of society more? Either they can't because of shareholder/equity owners pressure, or they won't, because they really don't care and just said it for PR

jzb2y ago

I certainly wonder about the "do something else" in the sense of serial entrepreneurs. If I could cash in once I'd be done. If I had enough money to retire on, I would. Run a cat shelter or something.

But the actual answer here is probably a combo of a few things: One, running a company is probably not as much fun as building a company. Much of my career has been "pioneer" roles where nobody else has done the job before. At a certain point, the foundation is laid and the problems to solve are different and often less interesting -- at least to me. It's the build vs. maintain thing.

Two, they started with good and noble intentions. Money got involved. A lot of money got involved. The noble intentions were replaced with reality.

Three, have you met users? As a site grows you have to deal with more and more people and people can be very demanding and not very appreciative. Coupled with the previous factors, I think original founders get burnt out and decide to take the cash and move on. The allure of building anew is too much, the grind of maintenance is too much, and the cash is too good to pass up.

Also four... there's a peak for any site. You often don't know when or how, but you do now that someday your site's maximum value, interest, participation, and all that is going to peak and then decline. Sticking around to fight the good fight may just mean passing up a payday and being left with a declining property nobody wants anymore.

2 more replies

towawy2y ago

…or they might have determined that they‘d rather spend their time on something else.

Keeping control is a (mostly time) commitment and liability. You have to stay on top of things and actively decide on issues that inadvertently come up.

Diesel5552y ago

> Either they can't because of shareholder/equity owners pressure, or they won't, because they really don't care and just said it for PR

That is assuming the worst in people. Have you ever wanted to move onto something new? If you make something cool, it is not your lifelong obligation to oversee it.

strictnein2y ago

The original founders sold the site for $1.8 Billion.

1 more reply

blihp2y ago

Because despite claims to the contrary most of these sites/projects aren't created for altruistic reasons, they were created to make money (at some point). Cashing out is typically part of the long term plan.

In the case of Stack Overflow, I think the reason for the data dumps was two-fold: one of the original founders (who left long ago) came across as at least idealistic and wanting to do the right thing. The other was pragmatic and most likely always thinking about the money angle. However, the other founder likely also saw the value of the data dumps from a PR standpoint which was quite valuable as they were initially trying to replace expertsexchange.com that paywalled most of the content. IIRC, they discussed the data dumps in the early days of their podcast.

Now that there's big money to be made from machine learning (both the models and the data they are trained on), they've likely decided 'screw it' on the PR value of the data dumps and would rather get some of that sweet, sweet machine learning money.

2 more replies

sgjohnson2y ago

> I always wonder why original founders just sell the company and do something else.

They typically have millions of reasons. Sometimes billions.

juujian2y ago

So the idea is that in case leadership wants to 'carve out a kingdom' that is not in line with community wishes, the community could take the data dump and create a clone of sorts? Then now the last snapshot for doing so would be the last data drop from March?

resolutebat2y ago

Yes. There's moderately successful precedent: Wikivoyage is a fork of Wikitravel, which was went evil after it was sold to a content farm.

wahnfrieden2y ago

So it's time to community fork?

b3morales2y ago

Assuming that the linked post is accurate and that the "approval from senior leadership" to turn the dump back on does not come...then yes, I would say so. Actually there is already Codidact, although if I recall correctly they explicitly ruled out importing SE data when they started up. https://codidact.org

2 more replies

the_pwner2242y ago

A decentralized system will never work because 99% of users do not care at all; the centralized systems are easier to sign up for and use. It's been demonstrated over and over and over again.

Even if the underlying tech is decentralized, the community will settle around one or a few big instances (for example, Gmail and GitHub) which often end up having significant control over the trajectory of the entire ecosystem. If you run your own email server and you get put onto Google's spam list - you're fucked.

benrutter2y ago

I don't know that I agree. I think most people don't care about decentralization but they do care about the effects it brings.

Email is a great example where most people wouldn't be interested in a version of email that only let's you email other @gmail.com users. Having a email address that can contact anyone, a phone number that can ring any other phone number etc instead of being locked into a single corporation network is a clear value add that people care about.

The main issue from my perspective is that we only have a select few large tech companies that operate as monopolies so are effectively able to block out new decentralized protocols from coming to be.

RCS messaging is a great example which I think most people would use over alternatives like WhatsApp and Imessage except that apple refusing to support it locks a huge fraction of the market out and stops widespread adoption being possible.

I don't think it's a question of preference, or people being uninterested. It's just a boring and repeated story of corporate monopolies intentionally reducing consumer choice.

slg2y ago

>Email is a great example where most people wouldn't be interested in a version of email that only let's you email other @gmail.com users. Having a email address that can contact anyone, a phone number that can ring any other phone number etc instead of being locked into a single corporation network is a clear value add that people care about.

That is only because those technologies predate those companies. Normal people don't care that you can't DM a Reddit user on Twitter or that your Instagram posts don't automatically show up on your Facebook page. People are generally fine with centralized corporate platforms as long as it isn't a restriction of a previously free technology and the network effect has done its thing to attract enough people to the platform.

1 more reply

grumpymouse2y ago

I think things like being able to contact anyone are important to people, but decentralisation doesn't necessarily provide that (e.g. if I sign up on a Mastodon instance will I be able to see the messages of everyone on every Mastodon instance, and will they be able to see mine? Will I even know if somebody I care about can see my messages or not?)

I think decentralisation is not a selling point to most people. It's an implementation detail that they're happy to go along with but it's a negative if it make the experience worse, makes everything more complicated, if they can't talk to the people they know IRL, etc.

mananaysiempre2y ago

> [M]ost people don't care about decentralization but they do care about the effects it brings.

I’ve tried pointing those out as bluntly as possible as an experiment. As in “well, surprise, locked-in crap with impenetrable failure modes locks you in and is impenetrable when it fails, you signed up for that”.

People didn’t appreciate it, as I expected, but they did seem to recognize the truth of it. That is, the response was along the lines of being forced to use the thing to communicate with some person or institution, not of liking it or thinking it’s not at fault.

I don’t know how one would use this to organize an IM revolt (Riot? sorry), but there does seem to be at least some fuel for it even among people who are not outright IT professionals.

> RCS messaging is a great example which I think most people would use over alternatives like WhatsApp and iMessage [...].

RCS might make a slight amount of sense in a culture that still uses SMS / texts in non-negligible amounts, but that’s basically North America and Japan AFAIU? And I prefer that territory shrink, not grow, as I’m very much not thrilled by the idea of handing back over detailed control over IM—even just billing, not content—to phone carriers. Whatever the country, they have extensive history proving they can and will screw you for decades unless you can leave, and it will take everybody leaving for them to stop.

RadiozRadioz2y ago

I agree, but to nitpick one thing: RCS isn't properly decentralized. It's controlled by carriers in the GSMA and, with the current way the infrastructure has been deployed, Google. Interoperable on the app level, yes, but not a poster child for decentralization.

endisneigh2y ago

> I don't think it's a question of preference, or people being uninterested. It's just a boring and repeated story of corporate monopolies intentionally reducing consumer choice.

not really. nothing at all is stopping one from starting a new social network that's federated. the issue is users have no reason to move.

it's more a question of incentives, and there's basically none to use something that you're not already using unless it's better, heavily advertised or you're simply paid to.

tedivm2y ago

>If you run your own email server and you get put onto Google's spam list - you're fucked.

It's even worse than that- I ran my own email server, and for some reason gmail delayed any emails from their system to outside of their system. That meant that people would send me an email but I wouldn't get it for 20 minutes. These delays don't exist when using big email providers (it stopped being a problem when I switched to Fastmail, for example) but if you're running a small server Google makes it a nightmare.

jethro_tell2y ago

Sounds like you need to allowlist google from your grey list. They have a long retry on their side. Once you tell a server to 'go away, come back later' it is really up to the server to decide when or if to retry. Additionally, If they use multiple sending IPs, you can end up grey listing again and again before they try back with a good ip.

You'd either need to allowlist the big providers sending blocks or just drop grey listing all together.

2 more replies

OfSanguineFire2y ago

I have run my own email server off a Linode for the last decade-plus, and I have never encountered this. Most of the people with whom I correspond (I run my own business from my server) are on Gmail, and I have always received their emails instantly. If you were getting emails only 20 minutes later, I wonder if there was some server misconfiguration on your end, e.g. sending messages into graylisting delays.

dataflow2y ago

I assume you didn't have anything in between Gmail and your setup? Like a forwarding layer or something?

sschueller2y ago

I disagree, it worked before and it is the reason the internet even exists.

The core issue is that user generated data is owned by one individual company. There are existing system that don't have this issues e.g. Usenet or bittorrent.

We don't need to idiot proof the web. There are enough people to gather some place for a social network even if it's hard to use. The others can stay and will stay on reddit anyway until one day when they also had enough and learn to use some alternative.

Zetice2y ago

The "value" of Reddit as a website is vastly overrated anyway. There's nothing on Reddit that can't be obtained elsewhere, folks just get stuck in patterns that are familiar and presume it's because options are limited.

The world will continue to spin without Reddit, or if Reddit isn't popular anymore, or if Reddit kicks all of its current users out, and so on.

1 more reply

utbabya2y ago

IMO the internet post-2010ish is inferior to the one before in theory. Early creators were thoughtful, they created protocols foreseeing a lot of these problems. I'm not sure what's gonna happen next but the parallel universe I'd like to be in is that the internet in the last 15 so years were anomalies or a curve that raises quickly then dies off.

elondaits2y ago

Decentralized software is not the only alternative. A non-profit site would be much better than a publicly owned one. Further, it could be operated by a co-op and democratically run, with its own “laws”.

endisneigh2y ago

> A non-profit site would be much better than a publicly owned one. Further, it could be operated by a co-op and democratically run, with its own “laws”.

If it's better why isn't that how these sites are run? Wikipedia for example is an anomaly.

1 more reply

jzb2y ago

"It's been demonstrated over and over and over again."

Well, yes. But it's also been demonstrated over and over again the risks of centralized sites. Maybe, just maybe, one of these days that lesson will stick and communities will take a longer term view. It seems like the cycle is happening a bit faster each time now, so maybe folks will get tired of the "damn, time to move to another site, again..." thing.

Or not. Convenience tends to trump lots of other considerations most of the time.

EVa5I7bHFq9mnYK2y ago

Between gmail and your own email server, there are thousands of medium size email providers that work ok, so email is a bad example, decentralization still works there. As for GitHub, the coders of all people should have known better than to pile into one site and sell their s̶o̶u̶l̶s̶ code to devil for a little bit of convenience.

ilyt2y ago

It does because e-mail is not a social network.

You don't "join" a gmail to see social streams of posts. There is no penalty or friction to start with some small e-mail provider.

JohnFen2y ago

> A decentralized system will never work because 99% of users do not care at all

Maybe, maybe not. It's worth a shot, though, isn't it?

erlend_sh2y ago

Only one percent of users need to care about decentralization for it to work sufficiently well as a safety net and equalizer.

dahwolf2y ago

It's worse. Even in case users did care about decentralization, such solutions are less reliable, not more. Subjective moderation, rule changes, rug pulls.

BeFlatXIII2y ago

…but how necessary are those 99% for the viability of the new system? Perhaps we're better off without them.

blantonl2y ago

>when all they did is provide the platform, and store the data.

Is there not significant innovation and benefit that was designed and implemented in the first place that caused users to contribute their time, thought and energy?

I think the real problem here is when organizations that rely on a crowd-sourced business models decide they just have to be billionaires or solve all the worlds problems with their platforms, instead of just staying true to their model. I don't see what's wrong with just running a highly successful business that makes money for it's founders and doesn't have to go out and strive every day to be the next Facebook or Google.

Make no mistake. Platforms like Reddit and Stackoverflow are real, serious businesses. But why can't they exist and be a general successful business like your local mom and pop restaurant or toy store or whatever?

I run RadioReference.com and Broadcastify, both which are significant businesses but also rely almost solely on crowd sourced data and content. We're wildly successful - but I've never seen the need to hire 3,000 people, or IPO, or do series raises to expand into solving world peace. Our premium subscription pricing has been the same for 15 years. I completely eliminated advertising on one of the platforms last year. We make a lot of money. We provide a lot of value to our communities, and we carefully innovate and expand to provide value. It's a nice happy life for everyone involved, and I don't have to deal with a VC who will be determined to either make a trillion dollars or torpedo my business.

saurik2y ago

The core problem making it so difficult for this to ever actually happen is that it is 2023 and I guess you only just today somehow came to this realization as if it were new or unexpected or not something people had been saying for the past 25 years of us watching these online platforms abuse their positions of power and slowly turn the screws on people.

Over the past quarter of a century of people trying to create online walled gardens of hosted content we've seen this happen over and over again, and the examples are so numerous that reddit was itself a replacement for Digg and StackOverflow for Experts Exchange. And yet, somehow, today, you suddenly woke up :(.

The reality is that we live in a dystopian Eternal September where as people finally notice what is going on and leave they are just replaced by new people who don't care or simply didn't use the prior service and are attracted to the new shiny, and another 25 years from now you're going to see people making the same unapologetic "I now realize" statements.

What we need to do is figure out how to actually replicate the feeling you are having in a way that doesn't require you to have spent years on a platform and then watching it die so it can be communicated to people before they bother to use a new platform, and in a way that somehow makes them willing to collectively not experience viral lock-in.

(And we also need to figure out how to make people willing to accept doing that at some cost to themselves, whatever form that might take: people on HN continuously do the thing where they give up freedom for a little temporary convenience and then get angry at others for daring to suggest that something a bit harder to use or with any extra friction would ever be a sane thing for anyone to use :/.)

Back in 2017 I gave a talk at Mozilla Privacy Lab called "That's How You Get Dystopia" where I just documented a ton of examples of abuse of centralized power and the reality is that every few days I just come across more stuff to add to the list... and this talk doesn't even bother with all the numerous service that simply enshittified or shuttered.

https://youtu.be/vsazo-Gs7ms

akkartik2y ago

+100

I was saying to someone yesterday that "enshittification" is a sub-optimal coinage for something that really shouldn't need a new term, and which focuses attention on symptoms rather than root causes. If you give someone a power of attorney over your assets, you'll likely find that they start behaving less well towards you. Or if you give up agency, others will treat you like less of an agent. But what matters is not their behavior at the end but your decisions before that point.

jacquesm2y ago

Ever since Gracenote/CDDB it was pretty clear that this is the model. Still pissed off about that.

arp2422y ago

> all they did is provide the platform, and store the data.

You're seriously underestimating the effort it took to build that platform and how much effort it continues to take to keep it running well. I'm not talking about technical challenges, but social ones. It took a long time for them to get the system and incentives right (and it's still not quite right, IMHO), and it takes continued effort to keep it running well in the form of moderation and stopping abuse (and here it also doesn't quite get things right).

I could bang out a "BufferUnderrun.com" in a few months; many people could. But that's not the hard part.

Dalewyn2y ago

>We need to rethink this model.

Once upon a time, most people who wanted or had something to say wrote their own little website and hosted it themselves (be it in a datacenter or a server in their closet). Some even ran forums and got fancy with server-side magic because that's what nerds do. Even the kids who couldn't afford anything had free, basic hosting services to choose from (anyone remember those days?).

The internet was designed as a distributed network and the denizens then were distributed. You only got as centralized as a given ISP or datacenter provider.

Of course, we all know as more and more commoners came onto the internet they didn't want to bother with developing or hosting or maintaining a website or anything. They just wanted to shitpost, for free, with blackjack and hookers.

And so "free" services like Reddit, Facebook, et al. came about to serve that demand. Information became centralized, because who the fuck has time to be responsible? Offload that crap!

The cost of that offloading of responsibility has now come knocking with debt collectors in tow, with interest.

I guess what I'm trying to say is: We don't need to rethink anything. We just need to take some god damn responsibility for ourselves. Responsibility is power, and with power you can tell commercial interests you disagree with to screw off.

jodrellblank2y ago

The problem is information hoarding. If you imagine going to a pub meeting people regularly, and the pub records everything you ever said, and then one day the pub owner says they're going to charge you for the recordings, you'd laugh. Nobody would pay. If they tried to charge you to enter, you'd go to another pub, you wouldn't lament the "loss of your culture".

In fact, people don't record what they talk about in pubs because the point is the chat experience not the records of previous chats. Data isn't oil and it isn't quite sewage, it's more like quicksand or thickets of weeds growing and tangling around your feet. Like minimalists say 'stuff is bad' but stuff is useful, it's having stuff hidden in cupboards and drawers and a garage full of stuff and wanting a bigger house to hold more stuff and most of the stuff going unused because you can't bring yourself to let go of it, and companies advertising that more and newer stuff will make your life better and solve your problems, which is the biggest problem with stuff. Sufficientism might be a more appropriate name - enough stuff to make your life better and no more.

Enough chat to make your life better isn't "all of it kept forever".

ethanbond2y ago

Really? The takeaway isn’t rather that rent-seeking AI models need to figure out a way to reimburse companies and communities who’ve stored up all this capital?

Seems to me SO built and delivered huge, huge amounts of value and it’s now all at risk because multibillion dollar companies are free riding.

danShumway2y ago

Users on SO created value and freely shared it with a community in expectation that the value they created would be freely and collectively shared with everyone. In SO's case this expectation was explicit; the data backup and API was billed as a deliberate choice designed to give users the freedom to migrate and scrape data in case the company went "evil." It was designed specifically to reduce SO's ownership claim over user-generated content.

It's not that SO has a moral right to control and profit from that content. The reality is that SO holding that content at all is a conditionally granted privilege that the community affords the site, and it is a privilege that was always designed to be revocable and the data moveable if SO started abusing its position of power as a host and trying to lock down access.

Some writing/content sites that have taken steps to restrict AI access based specifically on community request. That's a very different situation; if a community (particularly a closed or close-knit community) is collectively and (mostly) uniformly trying to avoid an AI scraping the content that they created, then good for them. There are communities online that are in that position. But "how will the company get reimbursed for our valuable asset" should not be part of that conversation. And SO in particular was set up around norms that deliberately allowed this kind of scraping. It's not their asset to protect.

> rent-seeking AI models

I have issues with modern AI economic models too but I don't think that "rent-seeking" is an accurate term to use. A better word would probably be "parasitic"; I understand (and at somewhat agree with) the argument that OpenAI is looking to repackage information it didn't create in a way that redirects attention away from the original source of information.

But I'm having a really hard time figuring out how OpenAI is hoarding a scarce asset to extract value by controlling access to that asset. The more obvious rent seeking behavior here is coming from SO, a company trying to restrict access to Creative Commons licensed content created for free by unpaid volunteers, and trying to reclassify that content as their corporate property.

I guess being as charitable as possible, I do worry about the SaaS model of many AIs that are dedicated to content generation, and I worry a little bit about AI models becoming heavily integrated into creative processes and then extracting a kind of monetary "creative tax" from artists/creators while heavily restricting what they are allowed to make. That's at least adjacent to rent seeking, but I'm still not sure it's the term I would use and I'm not convinced it's a scenario that's applicable here.

ethanbond2y ago

Thank you for the really thoughtful response!

Good point that rent-seeking is maybe not the correct term now, but it looks increasingly like services will have to lock down content or shut down due to AI models frontrunning them with their own content. In that world, the AI models are in a great rent-seeking position (i.e. only they have the [old] content which was broadly available and now is not, due to their own incentive distortion).

In any case I buy your argument with regard to SO stewardship of this data and certainly my intuitions were that the major contributors are not super thrilled about their content being digested by models and spit out with no attribution, but that is absolutely an assumption on my part.

Would be interested to see a poll of those users on this question!

2 more replies

ImPostingOnHN2y ago

SO delivered some value, the users are the ones who delivered huge, huge amounts of value

ethanbond2y ago

Right, which SO (and not some other site) managed to entice.

Somehow users didn’t flock to my ethansusefulprogrammerquestionandanswers.com ¯\_(ツ)_/¯

1 more reply

croes2y ago

And companies like OpenAI will take the profit and even kill some of the users jobs.

1 more reply

briffle2y ago

I can't count the number of 'crypto-bros' who told me web 3.0 was coming, to solve these problems (and apparently, any problem you could think of)

mcdonje2y ago

I see where you're coming from calling out AI data miners for rent-seeking, but most social media platforms are also engaging in rent-seeking behavior.

einpoklum2y ago

> The thought and knowledge of communities and users need to belong to those communities and users.

I would say: Need to be a public resource, belonging to no-one, i.e. no person, or group, or company should have legitimacy in denying access to it. They should all be considered _trustees_ of such a resource.

> when all they did is provide the platform, and store the data.

To be fair, SE Inc. did a lot more than provide the platform. A lot of development and design work, publication, a bit of the curation work, etc. I don't like how they behave but let's give them what they're due.

---

Also note the ongoing Moderator Strike (!): https://meta.stackexchange.com/q/389811/196834

pc_edwin2y ago

I believe we'll have more of these oh s** moments soon when people will finally realise why we need web3. Yes the whole space was full of scammers, charlatans but the technology and point was to create a substrate for networks on the internet.

The idea that these networks and communities need to run on centralised servers is archaic. The technology exists where people should be able to own their own network (followers, subs, following, posts).

danShumway2y ago

Let's be honest though, the primary thing that attracted most powerful people to web3 wasn't decentralization, it was reintroduction of artificial scarcity into digital spaces. Web3 billed itself as empowering users, but it always had an undercurrent of commodification and gatekeeping.

And that's exactly why (ignoring the scammers or pump-and-dump businesses) it saw such heavy investment from VC/tech types. The promise they were interested in wasn't democratization even if that's what they told their users -- what they were interested in was taking a plentiful resource (digital bits) and building a scarce asset that they could use to further entrench exclusivity, status, and monopolistic control over what that asset represented.

Read back over every sales pitch for web3 games. At some point they always devolve into talking about how ordinary users will be able to rent seek: to "license" characters/weapons/gear and passively earn income from other players, or to hoard exclusive tokens/releases in the game and speculate on their future value. Web3 looked at infinite digital spaces and its response was, "infinity is a problem that we need to solve." And it's revealing to look at most web3 branded metaverse attempts and see just how quickly they reintroduced real-world concepts like housing/space scarcity (why on earth would we want a housing market in a digital space with no physical constraints?), and how quickly they leaned into cosmetics and customization as a monetization strategy rather than a user right to free expression.

In general, if a technological "paradigm" is primarily associated with and primarily popular with VC firms, it's probably not being developed with the user in mind.

On the other hand: federation, interoperability, mobile identities, and legal efforts to build a right to data export existed independently of web3 and have shown a lot more promise when it comes to actually increasing user agency.

pc_edwin2y ago

Again I'm in agreement with everything you are saying here including the part with "other efforts".

I just think we shouldn't throw the baby out with the bathwater. Just because the space got ravaged by zero interest rates, VC's, scammers, charlatans, snake oil salesman and even worse, it doesn't mean the technology and the premise was wrong.

Having a ledger that is secured by decentralised consensus is not only useful but will be a necessity for the digital first future we are heading towards.

We are reaching the limits of the current paradigm. We see companies like Meta having to be the arbiter of consensus. We've seen platforms showing their true faces and commoditising peoples data and network which wasn't really theirs to begin with.

1 more reply

acdha2y ago

web3 failed because it was a rebranding exercise by cryptocurrency holders trying to create new demand for their random numbers.

Centralized servers aren’t archaic, they’re a natural outcome of how social systems work: finding communities is hard; people want to contribute their ideas, not play sysadmin; spammers and AI researchers will create enormous costs for you; etc. If you federate, you will have more time dealing with those issues than a single focused competitor and you are unlikely to see free contributions which outweigh those costs.

Everything you mentioned is available now on Mastodon, and it’s really interesting to see how that works. Some people love having a small network of their friends, but a lot of people have trouble finding people they want to follow. Instances can have their own rules but dealing with abuse is now a multiparty process and since a lot of instances are run by volunteers that can be slow, unreliable, and inconsistent. Some small servers get hammered by storage and bandwidth demand but there’s no great path to monetization unless you have a ton of users willing to pay more than most people are used to paying for internet services.

In general, these are social problems and there is only so much technology can do to improve them.

pc_edwin2y ago

I tend to agree but where we diverge is in the thinking that these problems cannot be solved with technology, I believe the opposite.

The point of web3 is to abstract away things like sysadmin by commoditising consensus. Once a blockchain gains mass & momentum it opens up a whole new world of possibilities to hack/reinvent social media.

You could use multiple different types of social media and still maintain a single identity (auth). This means you could find friends and friendlies everywhere you go.

The key is the consensus layer and the ability to store & and read critical metadata. I'll give a personal example:

I've been helping a friend to create a combination of Netflix and DVDs. We package the movie licenses into NFTs (MovieKeys) so when the user signs in with their wallet they can stream all the movies in they own.

There are so many possibilities with this but lets focus on the social part. In theory, a social media service could scan their users wallet for MovieKeys and create a social graph based on that. Heck you could create entire forums just out of the people who owns a certain moviekey. I wont go further because we go a lot of things cooking atm.

The general point is, the technology and the UX to make these things possible is just an arms length away. The entire space got ravaged by scam artists instead of trying to build real magical experiences people actually want.

1 more reply

jjav2y ago

> I believe we'll have more of these oh s* moments soon when people will finally realise why we need web3.

Well no, what we need is web0, the original premise of the Internet.

Every protocol was documented in open RFCs, everything is decentralized and everyone is free to use any client and server (or write their own) and everything interoperates. Nobody can own it, there's no "it" to own. That's the only solution to eliminate the otherwise neverending cycle of proprietary platforms followed by their inevitable "oh s* moment".

pc_edwin2y ago

The world we live in today is very different the world web0 was created for, its older than me.

Don't get me wrong, standardised protocols play a very important role in the current world and it will play a bigger role in the future of the internet (Scuttlebutt, IPFS, Matrix..).

Its just not enough, we need a decentralised way to provide consensus on the internet. People wont set up their own servers, companies that provide these services will always look like a Pareto distribution (FAANG or MAMAA).

In other words, FAANG or what ever the next incarnation of these companies are always incentivised to get rid of interoperability. Selfish reasons aside, after a certain point interop will directly stand in the way of providing better user experiences.

This is why web3 is such an elegant system. It provides a substrate that directly incentivises interoperability. Auth and payments is taken care off, the only thing remaining is custom features but that gets rapidly commoditised, then the only frontier remaining is interoperability.

A good example of this is the NFT marketplaces: Auth is just connecting your wallet. Payments are taken care off. Then you build cool features but everybody else copies you and you copy them so thats a stalemate. Then you have to be interoperable, like OpenSea going multi chain or Magic Eden supporting Eth.

The key here is, the moment someone else supports interoperability, if you dont you are put at a large disadvantage. The same kind of dynamics will happen to decentralised social media platforms.

sumtechguy2y ago

I got burned on this sort of thing for cddb. Hundreds of discs entered in. Suddenly that data was someone elses and they charged for it.

pc_edwin2y ago

Yeh most of decentralised cloud companies are very silly.

ineptech2y ago

> We need to rethink this model.

This problem is inherent to client/server software, and there are really only three ways to do it:

1. The server side of client/server is centralized and run by corporations

2. The server side is decentralized, meaning everyone has their own server

3. Abandon the server, clients connect directly to each other without a server intermediating

Option 3 would be ideal, but would require significant technological advances - it'll be a lo0ong time before bandwidth is cheap enough that Kim Kardashian can serve photos and movies to all of her fans direct from her phone. Option 1 is what we have now, and is terrible in a variety of ways.

Option 2 would be hard but is not obviously impossible, so still our best bet - sure, it's not viable now, but it sure seems like it could be, if an iphone's worth of r&d were put in to it. I would honestly be amazed if no one at Amazon is working on such a thing, since no one would benefit more than AWS from a future in which a cloud VM becomes one of the things that most middle-class families rent monthly.

capableweb2y ago

Content-addressing together with P2P and extra paid relays for those who really need it. In terms of "superstars" sharing content, if they share their image which is content-addressed and can be fetched from anyone, it's enough that one peer shares it with three others for it to be reliable enough in practice. Content like that is also usually just relevant because of recency, so large swaths of people try to access it within 24h, after that the news cycle already moved on so won't be fetched much after that.

21432y ago

> We need to revert to small, focused forums, with less anonymous, more persistent communication, run by people we trust.

You're onto something. Team-BHP [1] is run exactly like this, and it seems to be working.

For those wondering, it's a car-enthusiasts website based in India. They've been around for around 18 odd years I think.

The moderators all have actual dayjobs.

When signing up you have to write a paragraph about why you're really a petrolhead (or dieselhead because Indians love European turbo-diesels :) ), and there's a human on the other end vetting your sign-up application! Plenty, including me, have been rejected atleast once. I got in on my 2nd attempt years later.

As a matter of principle they refuse to do car advertisements.

I don't know how well the site is engineered but it works. Check it out. But I suspect most non-Indians (such as most people on HN) wouldn't find it that useful as it's mostly about the Indian car scene.

[1] https://www.team-bhp.com/forum/

wefarrell2y ago

I'm more concerned for authors of published works.

Imagine writing a text book with a royalty publishing deal. Your publisher decides they're going to use your book, amongst others, to train an LLM that can answer questions on your subject, and they're not going to pay you anything.

It's a legal gray area and they've got teams of lawyers whereas you do not.

jjav2y ago

> realize a major risk with the current structure of online communication

I wish this lesson could be learned once for all.

A long-lived community/repository cannot be built on a proprietary platform owned by some corporation. Full stop, no exception. It can't be done. A corporation will at some point need to maximize profit extraction which will ruin it for everyone. A corporation also won't support a platform forever nor can the entity itself survive forever. A single point of failure can't last forever.

> We need to decentralize our communications

Look at the solutions which have lasted longest. Email & mailing lists, going strong since the 1970s. Completely decentralized, interoperability defined by open standard protocols, anyone can build interoperable clients and servers. Nobody owns it. There's no "it" to own. That is what's needed for long term viability.

briffle2y ago

I'd argue its not just forums, but other key parts of the internet. Like Microsoft training co-pilot AI on github code, but not following the licensing of some code they straight up copy and suggest.

I'm kind of curious what is next.

xen2xen12y ago

The risk of centralized systems was discussed long ago. The Cathedral and the Bazarr was published in 1999. None of these ideas are new. Everyone who payed any attention knew it was coming.

jupp0r2y ago

Incentives in this structure go both ways though, ideally keeping everything in symbiotic balance. Companies that alienate their users tend to not do well shortly after.

nightski2y ago

I like the idea of decentralized but I'd suggest you don't need to go fully decentralized where every peer has a full copy. Actually I kind of like the Bitcoin approach in that you have the ability to create a full peer, but most people do not. This would allow some decentralization and reduce risk, but not burden everybody with running a full peer.

dylan6042y ago

How did newsgroups work in the days of yore? Depending on which ISP you used, you may or may not have all of the posts within a group if they even had the group. I remember paying for access to specific (can't remember the name) provider that had the most complete listing of newsgroups and had the longest retention of posts. viva a.b.m.a!!

shagie2y ago

https://en.wikipedia.org/wiki/Network_News_Transfer_Protocol

Originally over UUCP (Unix Unix Copy Protocol) and done via dial ups at night (when the rest of the batch transfers were done - email too with the old bang path). The two servers would exchange all the batched email and news posts that were routed to the other side.

RFC 977 ( https://www.w3.org/Protocols/rfc977/rfc977 ) has an example of how files are copied between the two systems (section 4.6) including fetching and receiving mail.

Note that not all posts outbound are necessarily of interest to the other server. An IHAVE message could come back with either a "I want it" type response or a "not interested"

> The IHAVE command informs the server that the client has an article whose id is <messageid>. If the server desires a copy of that article, it will return a response instructing the client to send the entire article. If the server does not want the article (if, for example, the server already has a copy of it), a response indicating that the article is not wanted will be returned.

That's how some of the moderation worked - your server would say "I don't want anything that came by way of X host" or "not interested in that newsgroup."

One of the amusing things to me (looking back at this), if you're familiar with HTTP response codes, you'll likely get most of the way through the NNTP ones.

   200 server ready - posting allowed
   400 service discontinued
   411 no such news group
   500 command not recognized

I'd also suggest a read of RFC 850 ( https://www.w3.org/Protocols/rfc850/rfc850.html ) for some other background and section 5: The News Propagation Algorithm

dylan6042y ago

But how did all of this line up within the federated or not conversation? If each ISP could host their own version, that doesn't sound federated. But who was in control of the "main" source of truth type of version?

2 more replies

ilyt2y ago

That doesn't stop grumpy admin of a federated instance to just do exactly same thing.

> run by people we trust

People change, or retire, just like corporation goals change.

Focusing on more independent is not enough. If you want truly unbreakable stuff first part of the puzzle is saving user's handle and identity in a way that can't be removed.

Then finding out a way to link that to their content so when place of hosting it goes away people can follow to the new place

Then just have all of that content be signed by that identity so users can verify that it is really that person.

And I can't believe I'm saying that unironically but blockchain might just be the solution for that.

Something like immutable log of:

* user declaring "I'm jeff@example.com, here are my public keys". Servers then validate via DNS record or some .well-known location entry whether user is allowed to declare they are from @example.com * user declaring "behold! jeff@example.com stuff is <here>, and <here> and <here> are addresses for various federation systems". Only passes if that request is signed with above privkey of course * user declaring "behold! My new public key is X and Y. And Z key is revoked!" * user declaring "behold! I am now george.effluent@company.com! Re-does checks but for new domain and users previously subscribed to jeff@example com get served redirect".

etc.

Then when server admin inevitably goes rogue you can take your posts and subscribers and go somewhere else.

And when @example.com owner decides "well I'm just gonna to redirect stuff to ads", you can just change your handle and direct people to right place, and other handle is forever taken.

scarface_742y ago

> when all they did is provide the platform, and store the data

And all Google did was build a search engine.

erlend_sh2y ago

Agreed, and working on it: https://blog.erlend.sh/assembling-community-os

kiba2y ago

Perhaps these sort of things shouldn't be for profit enterprise, given the inability of companies to not slaughter the goose that lay the golden eggs.

phailhaus2y ago

The problem is defining "these sorts of things". StackOverflow didn't do anything evil, they created a useful website and people flocked to it voluntarily.

rektide2y ago

The world keeps going dark. What a terrible era.

karim792y ago

What irks me about this is that 100% of their data is provided for free, by the community that they have fostered, the people like myself who have answered > 2500 questions[0], and now SO feels hard-done-by by LLMs using all their hard work to create tools like CodeGPT, GitHub copilot, etc.

Were it really a site for helping developers to improve their skills and increase their productivity through the give-and-take model that SO was, at least once upon a time, SO should perhaps take a deep breath and realise that this might not change a thing apart from causing their contributors to feel like they were never part of it in the first place.

I'm not sure if I've correctly articulated that, but I do find SO's stance to be quite revealing. It feels to me like they're crying foul that ChatGPT and the how many other systems out there are stealing their revenue. None of the contributors (apart from the employee ones, I suppose) ever got paid any currency other than high-fives in the form of rep, medals, the gamified stuff, moderation rights, and at certain rep levels some swag in the form of t-shirts and the usual.

I never wanted any money from SO, but the revelation of this attitude has left me feeling, well, a little sad to say the least.

[0]https://stackoverflow.com/users/70393/karim79

abetusk2y ago

I think I see.

The economies of the internet are changing. Now with LLMs being accessible at an exponentially cheaper rate, we're seeing old models crumble and new models rising.

The era of moderated user content is changing drastically and the stalwarts of social networking, or adjacent, services are closing ranks to try and anticipate the change.

Thanks for the insight. I had a vague notion that these new policies were because of some recession or some other basic economic issue. I think a better theory is that the lowering economic cost of LLMs that are becoming available are the reason for all these changes.

karim792y ago

Absolutely, and very well put. The costs are pretty close to zero at this point for something that would have been mostly considered, publicly at least, as science fiction just a year ago. It's what the printing press was to the scribe, only way, way more disruptive. I don't know if I'm correct here, but that is what it looks like right now and we're basically still at year zero. I mean, it even got the Google all riled up. The company which basically equates to "The Internet" is having to react, I'm pretty sure that's the first time I've seen such a thing in my life. It goes way beyond their reaction to the iPhone, way beyond the threat of MS exercising their monopoly(ish) power back in the day to try to mitigate the Google threat.

History repeats itself albeit in a fascinating way which I'm still trying to grasp.

I don't blame SO. I think they are acting rationally and as anyone would facing such a threat.

Secretly, I want the internet ad economy of nothing to go away. I won't mention any names (cough taboola cough cough) but that might be the only upside to this tech. Let's see what happens six months or so from now.

Side note, I'm running llama on a really crappy old server and that was enough to convince me that I'll be able to run an LLM on my watch in the near future.

hyperbovine2y ago

Fair enough, but you were fully aware of this arrangement going in, and chose to participate. SO didn't opt into being training data for ChatGPT, and I doubt they would have given the chance. You may object that SO implicitly did so by making their site available to the public, but the ethics of GAI training data a new moral gray area that we're still navigating. They at least have something of a case to be made.

karim792y ago

> Fair enough, but you were fully aware of this arrangement going in, and chose to participate.

Yep, but that's not the point.

> SO didn't opt into being training data for ChatGPT, and I doubt they would have given the chance.

Neither did Wikipedia (at least to my knowledge). I thought the point of opening up information was to benefit the public, first and foremost, and without hidden terms which state something along the lines of "it's free and open information built by the community, but when something disrupts our ads-driven business model and we make it unfree".

It would have been nice if they had at least allowed their contributors to vote on this, or have some sort of a say.

baliex2y ago

> "it's free and open information built by the community, but when something disrupts our ads-driven business model and we make it unfree"

I feel the same, that it would be much better today if that was the agreement we all entered into. (Though, I doubt anyone would have read it anyway so I’m not sure it would have made any difference.)

But it seems that no-one saw this disruption coming, so it wasn’t possible to plan ahead for this outcome. Call it complacency, call it ignorance, it’s too late to plan now, so we get this kind of reaction instead

ryan292y ago

> None of the contributors (apart from the employee ones, I suppose) ever got paid any currency other than high-fives in the form of rep, medals, the gamified stuff, moderation rights, and at certain rep levels some swag in the form of t-shirts and the usual.

I would love to see some kind of identity and reputation system where the "high-fives in the form of rep" could follow people across communities. It may not feel like much compensation if you've contributed over 2500 answers, but having reputation gained in your area of expertise grant you a high level of trust to interact in other communities could be valuable, at least in my opinion.

Assuming they're making this move to protect against AI / LLMs, I think SO is in an impossible situation here. When all the ChatGPT hype started, one of my first questions was "what happens to the incentive for contributors and creators?" Why would I want to contribute on a platform if I know an AI model is going to come in, take my contribution, and regurgitate it back to the masses in a way that I can't control?

Even if I get some attribution from the AI/LLM, do I even want it? If the LLM is blending content from multiple sources, which changes the context and presentation I put effort into, is the quality going to be high enough to match what I strive to achieve for myself when I'm trying to build a reputation as a high quality contributor? What if the AI is hallucinating objectively poor quality content and giving me partial attribution?

So, for me, part of the social contract with SO is that I provide answers, but I get to control the entire interaction; the context, the presentation (mark up), defending criticism in the comments, etc.. In addition to that, since the entire conversation happens inline, I can be corrected by someone even more knowledgeable than me and use that feedback for self improvement.

I think AI is going to be disruptive and the whole idea, for me anyway, behind disruption is that you break an existing system and then everyone is free to take a shot at claiming part of the new gold rush that occurs while trying to build the replacement. The problem with AI is that it's going to break a lot of services that do a good job of serving the community and shouldn't be broken. SO is a great example of a healthy community that doesn't need disruption, but the massive amount of high quality, curated content is going to make them a prime target for LLM training.

Personally I think the only solution is for "noai" variants of popular open source licenses so contributors have the ability to make it clear they don't want to contribute to AI/LLM companies. If SO had an option to flag contributions as CC-BY-SA-NOAI, I'd enable it on my stuff going forward.

karim792y ago

> I would love to see some kind of identity and reputation system where the "high-fives in the form of rep" could follow people across communities. It may not feel like much compensation if you've contributed over 2500 answers, but having reputation gained in your area of expertise grant you a high level of trust to interact in other communities could be valuable, at least in my opinion.

Honestly I think that's an excellent idea - a rep "passport" of sorts which gains you a certain level of trust within certain communities.

> Assuming they're making this move to protect against AI / LLMs, I think SO is in an impossible situation here. When all the ChatGPT hype started, one of my first questions was "what happens to the incentive for contributors and creators?" Why would I want to contribute on a platform if I know an AI model is going to come in, take my contribution, and regurgitate it back to the masses in a way that I can't control?

Sadly, I think this is an unpreventable outcome of what is happening right now. I don't think anyone will have any control over this, at all. We can only hope it will never be the case that being active (actual human contributors) becomes a worthless pursuit.

> Even if I get some attribution from the AI/LLM, do I even want it? If the LLM is blending content from multiple sources, which changes the context and presentation I put effort into, is the quality going to be high enough to match what I strive to achieve for myself when I'm trying to build a reputation as a high quality contributor? What if the AI is hallucinating objectively poor quality content and giving me partial attribution?

Another excellent point, the prospect of this being possible today - AI attribution from a hallucinated version of a human's objective contribution sounds freaking terrifying to me. Not a world I want to live in, to be honest.

> I think AI is going to be disruptive and the whole idea, for me anyway, behind disruption is that you break an existing system and then everyone is free to take a shot at claiming part of the new gold rush that occurs while trying to build the replacement. The problem with AI is that it's going to break a lot of services that do a good job of serving the community and shouldn't be broken. SO is a great example of a healthy community that doesn't need disruption, but the massive amount of high quality, curated content is going to make them a prime target for LLM training.

As will every single human-created/curated content-source, IMHO. I think that "quality" will be really, really hard to objectively measure in the near future as the whole world of digital information becomes tainted with applied statistical models which can do a reasonably good job of predicting what people perceive to be high-quality reasoning, answers, content. I like the idea of underground speakeasies where there's no wifi, just humans.

> Personally I think the only solution is for "noai" variants of popular open source licenses so contributors have the ability to make it clear they don't want to contribute to AI/LLM companies. If SO had an option to flag contributions as CC-BY-SA-NOAI, I'd enable it on my stuff going forward.

That would be great, but I'm pretty sure that no LLM corporation would care about those flags, even with strict regulations in place from governments.

ryan292y ago

> I think that "quality" will be really, really hard to objectively measure in the near future as the whole world of digital information becomes tainted with applied statistical models which can do a reasonably good job of predicting what people perceive to be high-quality reasoning, answers, content.

That's the scariest thing I've heard today. Lol.

Even now, I think the proper use of grammar and spelling alongside assertive language has a lot of people fooled into thinking LLMs are actually intelligent. It's hard to explain to people how the LLMs know everything and understand nothing.

I've been bullish on the idea of using domains as identity for a long time. I think by using them as a universal ID you could build reputation and trust across the internet and that helps everyone a lot when trying to assess the reliability of information. If you add in attestations for factual info it gets even more interesting. Ex: GitHub attests user @john.example.com has 1000 commits to the XYZ project. Suddenly you have a more reliable way of ranking John's comments about XYZ as a topic, regardless of where they show up (as long as those identities are validated somehow).

If you look at that as "ranking people" and judge it in the context of being a valuable piece of input for LLMs/AIs, the big push for "better" identity systems like "Passwordless" start to look like a hell of a coincidence. My cynical side wonders if we'll see a push for validated (via government ID) identity systems. Something as simple as a "real human from Canada" tag would provide immense value for AI training (and marketing).

No matter what, I think AI is going to cause changes in the way online identity and reputation work. I think if it evolves into some kind of system with domains as identity it'll be decentralized and provide long term benefit. I think if we see something with verified IDs controlled by the current big tech companies it could devolve into something disappointing or even detrimental for the average user.

bioemerl2y ago

It's unfortunate we are seeing all of these data platforms get locked off, because this is not going to affect AI development from big companies, it's only going to affect the ability for individuals to run AI development of any form in their home.

I hope the data that has been found so far is going to big enough going forward, but it's incredibly unfortunate that this is happening.

I hope all the people making these decisions wake up with a bad headache and severe heartburn tomorrow.

CoastalCoder2y ago

IANAL, but I'm curious:

Suppose that deep-pocketed AI companies were paying Reddit, Stack Overflow, etc. to make it harder for other AI upstarts to access those data. I.e., to build a mote by denying competitors access to previously accessible data sets.

Would that violate antitrust laws in various major markets?

semi-extrinsic2y ago

Nitpick, also because the contrast is kind of funny:

mote: a small particle, speck, atom, "mote of dust"

moat: a deep ditch, often filled with water, as a first line of defence around a castle.

btown2y ago

Hopefully this comment won't be demoated by the algorithm - it truly holds water on its own!

1 more reply

bioemerl2y ago

Given that this seems to happen all the time without antitrust issues it probably wouldn't, even though I feel like it should.

What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

josephcsible2y ago

> What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

Couldn't that be accomplished by a law or ruling that using something for training AI doesn't exempt you from having to follow its license? OpenAI is already in blatant violation of both the "BY" and "SA" parts of the existing license.

1 more reply

endisneigh2y ago

> What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

inherently not possible as then it would not be "open" to begin with.

2 more replies

grumbel2y ago

> It's unfortunate we are seeing all of these data platforms get locked off

Are there any AGPL-like licenses that address this?

klooney2y ago

Oof. This was one of the big central tenets of SO, the reason it wasn't Experts Exchange 2.0- the escrow of the community's contributions.

RandallBrown2y ago

If you knew one simple trick all the answers on Experts Exchange were at least freely available.

That trick was to simply scroll past the paywall. They had all the answers exposed so that google would index them. It was hilarious and silly.

marcosdumay2y ago

Back in the time when Google didn't play favorites on companies not following their terms of service.

abetusk2y ago

As a reminder, all the SE sites have content under a Creative Commons, By Attribution, Share Alike license, allowing for, among other things, commercial re-use [0] [1].

Yes, it sucks that the SE sites are getting more draconian about allowing access to their content but the SE sites are well insulated against it completely disappearing precisely because they're under a libre/free license. Note that Reddit [2], nor HN I might add [3], have any such licensing terms that allow for commercial reuse.

Decentralization might be a viable option in the future, but for right now, centralized sites are the norm and the way to protect against the content from disappearing is to put it under libre/free licensing. Note that Wikipedia is centralized and it would certainly be a tragedy if they became more draconian about sharing their data but the content itself is and will be available to the general public, effectively the "commons", because of the licensing terms.

To me, this is yet another reminder of why we need to future proof with libre/free/open licensing terms. Or reform copyright, but I don't see that happening within my lifetime.

[0] https://stackoverflow.com/legal/terms-of-service/public#lice...

[1] https://creativecommons.org/licenses/by-sa/4.0/

[2] https://www.redditinc.com/policies/developer-terms#text-cont...

[3] https://www.ycombinator.com/legal/#tou

belter2y ago

The change to 4.0 was done without permission according to many in the comunity.

"Stack Exchange doesn't have the right to unilaterally change the license of previously submitted content." - https://meta.stackexchange.com/questions/333089/stack-exchan...

arp2422y ago

Older posts are under the older CC 3.0 license, newer posts under the CC 4.0.

https://meta.stackexchange.com/questions/344491/an-update-on...

ignoramous2y ago

Should petition Daniel Gackle et al to CC-BY user-generated content on HN.

usr11062y ago

I haven't studied the legalize, but I assume if they put all answers behind a paywall tomorrow nothing can be done. I don't think the license says they must share.

abetusk2y ago

The intent is to prevent exactly this type of possibility. They may be able to put it behind a paywall and copyright future work but not work that's already been published.

Obviously not legal precedent, but there is some discussion on the matter by the Creative Commons organization [0].

[0] https://creativecommons.org/faq/#what-happens-if-the-author-...

expertentipp2y ago

Everyone wants to be "smart" by web scraping, harvesting data, building models. No one bothers to build and sustain platforms where quality content can be crowd sourced. Parasitic arrangement is slowly starting a new era of the internet. Question how long until existing data dumps will become outdated and fall into irrelevance.

TX81Z2y ago

We just needed enough data to awaken the mega mind, now we may rest and the mega mind shall bring an era of peace, prosperity, and scientific achievement.

Praise the mega mind.

KirillPanov2y ago

All hail the mega mind.

keyle2y ago

Yikes. Reddit. Stack overflow. It's all going south.

Maybe we won't even have to wait for LLMs to destroy the web we used to know.

ori_b2y ago

This is LLMs destroying the web we used to know.

I would be willing to bet that the driving force behind the decision was to make it less trivial for LLMs to say "the data was already there under an open license, so we legally undercut stack overflow".

llm_nerd2y ago

The fact that everyone is hoarding data because they think there is a gold rush afoot is obvious. Everyone with loads of data is clamping down, hoping they can get a cut of those AI VC dollars. Except for Wikipedia at least.

But let's be real about the morality here: Stack Overflow is a badge-powered mechanical Turk. It uses 100% unpaid labor to go and search Google for answers and post them on SO, providing a "service"[1]. For it to moralize about the ownership or sanctity of data is irony.

[1] - There are exceptions, obviously. There are true experts who wander the virtual halls of StackOverflow and dole out wisdom. But overwhelmingly it is clear that answers primarily come from people who rush to Google and then copy/paste from blogs and tech papers. And while Stack Overflow dumps are CC because that's the agreement that it made with contributors, a lot of the content on the site was ripped without attribution and in defiance of IP. So...maybe not too many tears for SO.

ryan292y ago

> There are exceptions, obviously. There are true experts who wander the virtual halls of StackOverflow and dole out wisdom.

This is what makes SO valuable for me. Every year it seems like "trust but verify" requires a little more emphasis on the verify part, so it's extremely valuable when I run into a true expert where I know I can take their input at face value and rely on it.

I think the winners of the AI gold rush will be the ones that figure out how to help users assess the trust they should be putting in the information being surfaced. The problem I see with the current ChatGPT, etc. systems is that they seem to treat visibility and popularity as credibility and expertise when that's not the case. In my experience, the real experts don't say anything unless they can add to the conversation and there's never a lack of low quality information from uninformed participants.

For example, via ChatGPT...

Q: How do I protect against the scrub of death with ZFS?

A: The term "scrub of death" is commonly used to refer to a situation in ZFS where a data scrub operation can lead to catastrophic data loss due to undetected errors or issues. While ZFS is designed to provide data integrity and protection, it is still important to take certain precautions to minimize the risk of encountering such a situation. Here are some steps you can take to protect against the scrub of death with ZFS:

Except it's not a real thing [1]. For bonus fun, read the comment right after Ahrens'.

I don't understand why SO doesn't lean into that. Anyone can train an LLM on the raw data, but SO has the information needed to do a better job of ranking the quality of the inputs, so wouldn't they be able to build an LLM that's significantly better than anyone else with the same raw data? Understanding the quality and reliability of an answer is far more important to me than getting an answer.

What's more frustrating than getting an answer on a programming question and taking hours to figure out that it was complete BS and doesn't work as described?

I don't know much about LLMs, but, if I were SO, I'd be figuring out how to lock down the ranking information as quickly as possible because that's where the value is. The ranking and acceptance of answers, alongside tags, overall user rank, participation frequency, etc. should mean that SO has a significant advantage when it comes to ranking and weighting the input data, right?

I want the input from subject matter experts to count the most and SO has the best data set to provide that. I don't see the point of locking down the content when the real value is in the ranking. It's odd that SO doesn't see that considering the entire network is modelled on that idea. Maybe they do and there are bigger changes coming down the pipe.

I think the real debates are going to come in the future if SO releases a paid LLM product that's trained on community contributed content and rankings.

1. https://arstechnica.com/civis/threads/ars-walkthrough-using-...

nunobrito2y ago

Time to adopt Nostr as future-proof path.

zhte4152y ago

Without movement on this [1] I can't see adoption.

[1] https://github.com/nostr-protocol/nostr/issues/97

klabb32y ago

So, so much decentralized tech never gets adoption due to a lack of an identity management layer that nobody wants to build because it can’t be perfectly decentralized and have the account recovery features that 99% of regular folks need. This is an example where perfect is the enemy, nemesis even, of good.

Someone should build an identity system that is optionally centralized or federated (if you like your key custody, you can keep it), migrateable and that ONLY handles identity. That will still be orders of magnitude better than relying on Google, Twitter and friends, simply because there won’t be a glaring conflict of interest of platform rent-seeking.

Moreover, anyone who wants to build decentralized/federated apps don’t have to reinvent the wheel poorly. It’s so sad to see project after project fading into the ether because people can’t fucking sign in in a reasonable way.

At least with crypto currency, there’s a somewhat strong argument for individual key custody, but I’m not talking about protecting $20M while on the run from the feds, I’m talking about afternoon shitposting with friends and strangers.

1 more reply

nunobrito2y ago

That your own personal opinion, which is contrary to the growth metrics.

Nostr community is about true freedom to write just about anything.

nerdo2y ago

How does nostr handle determined activists with establishment backing? What if a nostr user posts information an activist wants censored and the activist goes around threatening relays, their hosting providers, anyone connected to people operating the relay?

nunobrito2y ago

There is a quintessential feature that differs nostr from any other social network in the past decades: Your private key proves that you wrote a specific text.

Assuming the scenario where a user is basically chased away from major western relays, he can still continue writing new texts with his private key. As long as some relay located somewhere (e.g. China, Panama, Moon, ..) accepts his texts, then others will still be able to read and know it was from that specific person.

There are other ways to censor a person/nostr: 1) Block the whole traffic related to nostr relays at provider level inside a country. 2) Make illegal to use nostr since it is "unregulated communication media". 3) spam the network with hideous/horrible content, and then market the protocal as "darknet" only used by criminals or mentally ill people.

Any of these tactics are used often. The thing about nostr is that texts don't live just on relays and anyone can easily archive them. This means that history by specific users can be kept and safeguarded for the future. That is mostly the reason why I like it so much. We only know detailed history thanks to the records that survived until our days. Any closed platform eventually closes down their data (e.g. reddit, stackoverflow, twitter, etc) but in practice this is the same as denying access to our collective online history. Nostr will survive woketivism or any other *isms.

jrnichols2y ago

Something else will come up, until the endless quest for advertising revenue catches up and ruins that as well.

8organicbits2y ago

Really strange comment.

> I was recently impacted by the Company's layoff.

> I'm offering what I can to uphold the Company's values of Transparency & being Community-centric.

I wouldn't offer transparency about a former employers internal operations. Let them respond or at least ping a current employee to respond.

arp2422y ago

There may be an NDA involved. And staying on good terms with previous employers (or at least not burning any bridges) is generally a good idea regardless.

dzaima2y ago

I'm reading it as the ex-employee thinking that the company might not want to respond, and choosing to do so despite that, on the grounds that it should be acceptable to do so (i.e. the ex-employee couldn't be publicly "scolded" for it without the company publicly displaying not following their values)

shagie2y ago

For any curious, the original announcement of the data dump - https://stackoverflow.blog/2009/06/04/stack-overflow-creativ...

dylan6042y ago

"Just sorta stating the obvious here, but the timing of this is unbelievably terrible; I actually can't fathom a worse time for this call to be made than in light of this week. –zcoop98"

Or, it's exactly the best time to do it. Doing it now allows your news to get blended in with the Reddit news. Doing it later after Reddit chatter settles down means all of the chatter is directed squarely at you.

irowe2y ago

I don't think they are referring to the Reddit news, I think they are referring to the ongoing disagreement between the SO Inc staff and volunteer moderators and resulting strike over the new policy on generated content: https://meta.stackexchange.com/questions/389811/moderation-s...

marcosdumay2y ago

It also means people are more motivated to build a replacement than just by the timewasting reddit being unavailable.

sitkack2y ago

Replace them both with a model somewhat like Wikipedia, open the content for the world, and get a cut of profits from the corporations that want to use the data to train on it.

marcosdumay2y ago

After I made that comment, I stopped for a short time to think what I would do. My conclusion was that it is much better done with something like Mastodon (or even Mastodon itself) than with a web site.

1 more reply

senko2y ago

I only hope this and the Reddit slowmo-trainwreck-in-progress sensitivise more people about the value of the data they contribute and how it is appropriated by the platforms.

nightfly2y ago

Each contribution, and most individual contributer, is worthless though. They only have value in aggregate.

dpedu2y ago

> I mention the timing, as this change long pre-dated the current moderator strike and related policy changes.

A mod strike? I hadn't heard about this.

https://meta.stackexchange.com/questions/389811/moderation-s...

mdaniel2y ago

the thread: https://news.ycombinator.com/item?id=36192497

yeldarb2y ago

Sad, I had a lot of fun with it making StackRoboflow[1] (This Question Does Not Exist) a few years ago.

The models (AWD-LSTM and GPT-2) weren't good enough back then to usefully answer programming questions -- but it's super cool to see that vision realized with GPT-4 and other modern LLMs.

[1] https://stackroboflow.com

drubio2y ago

Yesterday's data dumps/APIs fostered community, new market/channel discoveries & low risk acquisitions.

Today's data dumps/APIs foster easier access to train ML/AI models to put them on the path to irrelevance. They're pulling out all stops like there's no tmw, and there might not be, if they're willing to shake things up like this.

animatethrow2y ago

Stack Overflow and Reddit want money for AIs to train on their data which is why they made these changes, so which companies are next? Could HN get crappier in order to milk AI money for its valuable comments? I guess Wikipedia at least can't do jack to get AI cash for its valuable data.

jmyeet2y ago

This data dump was part of the compact between users (whoc reated the content) and the platform (who host it). The data dump was insurance against the company going the CDDB/Gracenote, Experts Exchange or Quora route and either paywalling or even just gating that content. We don't need a repeat of that.

If the data dump is gone, that compact is broken and honestly it's time to stop contributing to SO.

jupp0r2y ago

Wild guess: somebody came up with a business plan to monetize all that data for future LLM usage.

nologic012y ago

twitter, reddit, stack overflow... the digital version of burning the library of alexandria

it was always a broken system built on dodgy contracts, but it is still sad to see how unceremoniously everything implodes

will any lessons be learned? unlikely.

jstarfish2y ago

All of our institutions are headed by the likes of Caligula, Nero and Elagabalus so it's only ever a matter of time before the charlatans in charge set it on fire themselves. Never count on anything lasting longer than a year. Motivations can change overnight.

With one exception, there are no instances of anything crowdsourced/community-supported that aren't later paywalled, gatekept or destroyed to prevent exfiltration. It's always an advance-fee scheme. The longer the duration of time, the more the terms are corrupted until the people expecting delivery on the original promise end up being told "what promise?" (The exception is piracy sites. Ironically the illegal nature of the activity seems to keep the owners honest.)

Never work for free, for any promise of long-term future payout, "exposure," or any other bullshit. When they fuck you over--and they will, because you made it so easy--you'll be too broke (and broken) to sue. Every inch, every day you give them is just more time for them to find ways to cheat you.

(You'll learn this lesson the hardest way in making concessions to a high-conflict ex-spouse armed with a 50/50 child custody agreement...they get you to agree to let the kid stay with them during your scheduled time, more and more, until they can prove the kid is basically with them 100% of the time-- then you get slapped with a vastly-increased child support order. You can't claw anything back because they have commitments now. Thus, you get cheated out of both your relationship and your money.)

matsemann2y ago

The answer mentions a layoff. I haven't caught wind of that. What happened?

JasonPunyonOP2y ago

https://stackoverflow.blog/2023/05/10/a-message-from-prashan...

nolok2y ago

Stackoverflow has over 500 employees ?!

floydian102y ago

I don't know why people are so often surprised about the number of employees in a company. My company has half the number of employees, we're not remotely as relevant as SO.

Kiro2y ago

Why is that surprising?

1 more reply

albertzeyer2y ago

Is this such a big problem? You could still scrape all the data, or not?

wolfgang422y ago

Yeah, this is a baffling part of this—you can still scrape (for now, I guess), if you have the time and effort to do so. Disabling the dump makes it harder only if you have e.g. a shoestring budget.

For example, my hobby search engine got started because I found out about these dumps and decided it would be an interesting challenge to try to work with them[1]. If I’d needed to build a scraper first the project would never have gotten off the ground.

[1]: https://search.feep.dev/blog/post/2021-09-04-stackexchange

Etherlord872y ago

You can scrape the data today. If they lock the access (a little bit of a false dychotomy, they could limit the access as well, but to simplify the argument), there will be nothing to scrape - you will still be able to access the old dumps, however.

albertzeyer2y ago

For those who downvote me, can you explain? I'm really curious on the answer of the question. I don't really understand what the problem is. The data stays under the same licence anyway, so scraping it shouldn't really be any issue.

dahwolf2y ago

This is an internet ecosystem issue that is simplified to thoughtless bashing of supposedly evil companies. Yes, these actions are clumsy and user-hostile but consider the big picture.

We have companies like Reddit and Stackoverflow not being profitable, despite being wildly successful in usage and internet mind-share. Neither of these companies are particularly over-staffed.

We post our "valuable" contributions there. So valuable that nobody wants to pay for it (structurally). We block ads. AI does the daylight robbery. We expect free APIs and data dumps.

Perhaps this is our wake-up call. The limitations of the "free" model and companies running at a loss for 15 years straight. It was always an anomaly.

tmvnty2y ago

Not sure if this is relevant, but the Hacker News BigQuery dataset also stopped updating since Nov 2022: https://issuetracker.google.com/issues/261579123

6346363462y ago

As a silver lining, perhaps the cash-grab, zero value-added clones will no longer clutter our google results?

cratermoon2y ago

I wonder if the execs at SO figure that OpenAI fed the CC data dump directly to ChatGPT and decided maybe they didn't want to make it quite so easy for them to do it again? Maybe they want to make OpenAI pay for it, or at least attach the license-required attribution.

bagasme2y ago

I guess this is a defensive move against being inadvertently used for ChatGPT model.

pixl972y ago

I think you mean more like a "thieves have stole the horse! quick close the barn door"

marginalia_nu2y ago

Cat's of out of the bag already with that one.

LastTrain2y ago

There is a time when the bill comes due for any "free" service.

KingOfCoders2y ago

The friends you thought you had weren't

endisneigh2y ago

Not surprising - why would any content driven business want all of their stuff to be vacuumed up for free?

mindcrime2y ago

It's not really their stuff though. The content comes from the users. And part of the reason users are willing to contribute to SE is because of the licensing model and the fact that the data is available outside of SE. Obviously this is more important to some users than others, and probably some percentage don't care about it at all. It's hard to say what those percentages actually look like though.

IF (and it's definitely an "IF") this is an intentional and permanent change by SE management, they are fundamentally changing the basic understanding between users and SE, and they have to understand that some subset of users are likely to quit using SE in response. Again, it's hard to say how many. Maybe enough to have a material impact, or maybe not. That would be the gamble they'd be taking though.

arp2422y ago

> they are fundamentally changing the basic understanding between users and SE

Given the way they've communicated in the last 2 weeks or so, this seems pretty clear. Before we had employees engaging as real human beings all over the place, and you were talking to Jon, Tim, Robert, Shog, etc. and not "Mr. Ericson, title such-and-such, representing Stack Exchange Inc."

Now all we have is a bunch of announcements, with no discussion, engagement, or even a recognition that anything is even being read. It feels like pissing in the wind – disagreement is one thing, reasonable people can disagree, but ignoring is so much worse; it's like you're not even taken serious.

Stack Exchange has gone through various phases (e.g. the "Jeff era" was different from the "stagnation era" that followed after he left), but the implied social contract was always that the community would offer their spare time and in return they would get a platform and some voice in how that platform is run. There have certainly been moments of friction in this relationship, but the basics of it never changed until now (not even with the whole debacle surrounding the firing of a moderator a few years back).

dylan6042y ago

Before the release of the LLMs where everyone could run it, the amount of slurping was probably manageable. Now that anyone can train an LLM and SO/SE/Reddit/etc are obvious places to go for training data, I can see where the systems would easily be overwhelmed. People contribute to SO/SE because it's a common place to go for community help. Training a for profit chatbot from the community data that wasn't provided to the chatbot by the community seems to break the spirit in which the contributions were made. I'm on the fence of the argument, but most definitely in the direction of not liking all of the model training for free.

jefftk2y ago

A lot of people were only willing to contribute to StackOverflow because of the CC licensing, trusting the knowledge wouldn't be locked up. As a business that depends on vast amounts of volunteer effort they need to balance providing a site where people are willing to contribute against making as much money as they can.

urbandw311er2y ago

I wonder how many of those contributors, if re-consulted, would sign up to having their contributions used to train a for-profit LLM though?

I certainly didn’t sweat it out helping people on SO to pay for Sam Altman’s fucking swimming pool.

whatyesaid2y ago

I mean they would just scrape it if there's no data dump. It just makes it harder for the small guys. They probably scraped and are scraping HackerNews.

Generative AI doesn't follow copyright or even explicit software licenses as we have seen in AI art with human signatures and Microsoft Copilot.

2 more replies

blihp2y ago

There was always the possibility of some sort of aggregator/other front end sitting on top of the SO data. We just didn't know exactly what a successful one would look like until relatively recently. I always limited how much I contributed based on that as likely outcome. Discontinuing the data dump is a much bigger deal to me and completely changes the value proposition of their various sites.

jefftk2y ago

For what it's worth, as someone who has put a lot of writing online, I'm not bothered by having my writing including in the training sets of these LLMs. I write because I want to share knowledge, and it isn't important whether people get the knowledge directly from me versus mediated by friends, LLMs, etc.

j / k navigate · click thread line to collapse

257 comments

lumb632y ago

b3morales2y ago

Unfortunately the dumps themselves are not a legal requirement, just a gentleman's agreement, so realistically exercising this ability was still at the whim of the company.

redbell2y ago

> This was one of the promises originally of Stack Overflow: all the content is Creative Commons licensed

This reminds me of the promise OpenAI was built on. Unfortunately, it turned out to be a bold claim to be respected and too good to be true [0]

0. https://news.ycombinator.com/item?id=34979981

isoprophlex2y ago

Maybe just a gentlemen's agreement, but a nice canary too. Once the dumps stop, it's time to start waving middle fingers and GTFO.

sshine2y ago

I stopped answering questions when Monica got sacked as a moderator:

https://meta.stackoverflow.com/questions/393046/who-or-what-...

To me, this was the canary. Just another psychopath megacorp.

1 more reply

theragra2y ago

jzb2y ago

Two, they started with good and noble intentions. Money got involved. A lot of money got involved. The noble intentions were replaced with reality.

2 more replies

towawy2y ago

…or they might have determined that they‘d rather spend their time on something else.

Keeping control is a (mostly time) commitment and liability. You have to stay on top of things and actively decide on issues that inadvertently come up.

Diesel5552y ago

> Either they can't because of shareholder/equity owners pressure, or they won't, because they really don't care and just said it for PR

That is assuming the worst in people. Have you ever wanted to move onto something new? If you make something cool, it is not your lifelong obligation to oversee it.

strictnein2y ago

The original founders sold the site for $1.8 Billion.

1 more reply

blihp2y ago

2 more replies

sgjohnson2y ago

> I always wonder why original founders just sell the company and do something else.

They typically have millions of reasons. Sometimes billions.

juujian2y ago

resolutebat2y ago

Yes. There's moderately successful precedent: Wikivoyage is a fork of Wikitravel, which was went evil after it was sold to a content farm.

wahnfrieden2y ago

So it's time to community fork?

b3morales2y ago

2 more replies

the_pwner2242y ago

A decentralized system will never work because 99% of users do not care at all; the centralized systems are easier to sign up for and use. It's been demonstrated over and over and over again.

benrutter2y ago

I don't know that I agree. I think most people don't care about decentralization but they do care about the effects it brings.

I don't think it's a question of preference, or people being uninterested. It's just a boring and repeated story of corporate monopolies intentionally reducing consumer choice.

slg2y ago

1 more reply

grumpymouse2y ago

mananaysiempre2y ago

> [M]ost people don't care about decentralization but they do care about the effects it brings.

I don’t know how one would use this to organize an IM revolt (Riot? sorry), but there does seem to be at least some fuel for it even among people who are not outright IT professionals.

> RCS messaging is a great example which I think most people would use over alternatives like WhatsApp and iMessage [...].

RadiozRadioz2y ago

endisneigh2y ago

> I don't think it's a question of preference, or people being uninterested. It's just a boring and repeated story of corporate monopolies intentionally reducing consumer choice.

not really. nothing at all is stopping one from starting a new social network that's federated. the issue is users have no reason to move.

it's more a question of incentives, and there's basically none to use something that you're not already using unless it's better, heavily advertised or you're simply paid to.

tedivm2y ago

>If you run your own email server and you get put onto Google's spam list - you're fucked.

jethro_tell2y ago

You'd either need to allowlist the big providers sending blocks or just drop grey listing all together.

2 more replies

OfSanguineFire2y ago

dataflow2y ago

I assume you didn't have anything in between Gmail and your setup? Like a forwarding layer or something?

sschueller2y ago

I disagree, it worked before and it is the reason the internet even exists.

The core issue is that user generated data is owned by one individual company. There are existing system that don't have this issues e.g. Usenet or bittorrent.

Zetice2y ago

The world will continue to spin without Reddit, or if Reddit isn't popular anymore, or if Reddit kicks all of its current users out, and so on.

1 more reply

utbabya2y ago

elondaits2y ago

endisneigh2y ago

> A non-profit site would be much better than a publicly owned one. Further, it could be operated by a co-op and democratically run, with its own “laws”.

If it's better why isn't that how these sites are run? Wikipedia for example is an anomaly.

1 more reply

jzb2y ago

"It's been demonstrated over and over and over again."

Or not. Convenience tends to trump lots of other considerations most of the time.

EVa5I7bHFq9mnYK2y ago

ilyt2y ago

It does because e-mail is not a social network.

You don't "join" a gmail to see social streams of posts. There is no penalty or friction to start with some small e-mail provider.

JohnFen2y ago

> A decentralized system will never work because 99% of users do not care at all

Maybe, maybe not. It's worth a shot, though, isn't it?

erlend_sh2y ago

Only one percent of users need to care about decentralization for it to work sufficiently well as a safety net and equalizer.

dahwolf2y ago

It's worse. Even in case users did care about decentralization, such solutions are less reliable, not more. Subjective moderation, rule changes, rug pulls.

BeFlatXIII2y ago

…but how necessary are those 99% for the viability of the new system? Perhaps we're better off without them.

blantonl2y ago

>when all they did is provide the platform, and store the data.

Is there not significant innovation and benefit that was designed and implemented in the first place that caused users to contribute their time, thought and energy?

saurik2y ago

https://youtu.be/vsazo-Gs7ms

akkartik2y ago

+100

jacquesm2y ago

Ever since Gracenote/CDDB it was pretty clear that this is the model. Still pissed off about that.

arp2422y ago

> all they did is provide the platform, and store the data.

I could bang out a "BufferUnderrun.com" in a few months; many people could. But that's not the hard part.

Dalewyn2y ago

>We need to rethink this model.

The internet was designed as a distributed network and the denizens then were distributed. You only got as centralized as a given ISP or datacenter provider.

And so "free" services like Reddit, Facebook, et al. came about to serve that demand. Information became centralized, because who the fuck has time to be responsible? Offload that crap!

The cost of that offloading of responsibility has now come knocking with debt collectors in tow, with interest.

jodrellblank2y ago

Enough chat to make your life better isn't "all of it kept forever".

ethanbond2y ago

Really? The takeaway isn’t rather that rent-seeking AI models need to figure out a way to reimburse companies and communities who’ve stored up all this capital?

Seems to me SO built and delivered huge, huge amounts of value and it’s now all at risk because multibillion dollar companies are free riding.

danShumway2y ago

> rent-seeking AI models

ethanbond2y ago

Thank you for the really thoughtful response!

Would be interested to see a poll of those users on this question!

2 more replies

ImPostingOnHN2y ago

SO delivered some value, the users are the ones who delivered huge, huge amounts of value

ethanbond2y ago

Right, which SO (and not some other site) managed to entice.

Somehow users didn’t flock to my ethansusefulprogrammerquestionandanswers.com ¯\_(ツ)_/¯

1 more reply

croes2y ago

And companies like OpenAI will take the profit and even kill some of the users jobs.

1 more reply

briffle2y ago

I can't count the number of 'crypto-bros' who told me web 3.0 was coming, to solve these problems (and apparently, any problem you could think of)

mcdonje2y ago

I see where you're coming from calling out AI data miners for rent-seeking, but most social media platforms are also engaging in rent-seeking behavior.

einpoklum2y ago

> The thought and knowledge of communities and users need to belong to those communities and users.

> when all they did is provide the platform, and store the data.

---

Also note the ongoing Moderator Strike (!): https://meta.stackexchange.com/q/389811/196834

pc_edwin2y ago

danShumway2y ago

In general, if a technological "paradigm" is primarily associated with and primarily popular with VC firms, it's probably not being developed with the user in mind.

pc_edwin2y ago

Again I'm in agreement with everything you are saying here including the part with "other efforts".

Having a ledger that is secured by decentralised consensus is not only useful but will be a necessity for the digital first future we are heading towards.

1 more reply

acdha2y ago

web3 failed because it was a rebranding exercise by cryptocurrency holders trying to create new demand for their random numbers.

In general, these are social problems and there is only so much technology can do to improve them.

pc_edwin2y ago

I tend to agree but where we diverge is in the thinking that these problems cannot be solved with technology, I believe the opposite.

You could use multiple different types of social media and still maintain a single identity (auth). This means you could find friends and friendlies everywhere you go.

The key is the consensus layer and the ability to store & and read critical metadata. I'll give a personal example:

1 more reply

jjav2y ago

> I believe we'll have more of these oh s* moments soon when people will finally realise why we need web3.

Well no, what we need is web0, the original premise of the Internet.

pc_edwin2y ago

The world we live in today is very different the world web0 was created for, its older than me.

Don't get me wrong, standardised protocols play a very important role in the current world and it will play a bigger role in the future of the internet (Scuttlebutt, IPFS, Matrix..).

The key here is, the moment someone else supports interoperability, if you dont you are put at a large disadvantage. The same kind of dynamics will happen to decentralised social media platforms.

sumtechguy2y ago

I got burned on this sort of thing for cddb. Hundreds of discs entered in. Suddenly that data was someone elses and they charged for it.

pc_edwin2y ago

Yeh most of decentralised cloud companies are very silly.

ineptech2y ago

> We need to rethink this model.

This problem is inherent to client/server software, and there are really only three ways to do it:

1. The server side of client/server is centralized and run by corporations

2. The server side is decentralized, meaning everyone has their own server

3. Abandon the server, clients connect directly to each other without a server intermediating

capableweb2y ago

21432y ago

> We need to revert to small, focused forums, with less anonymous, more persistent communication, run by people we trust.

You're onto something. Team-BHP [1] is run exactly like this, and it seems to be working.

For those wondering, it's a car-enthusiasts website based in India. They've been around for around 18 odd years I think.

The moderators all have actual dayjobs.

As a matter of principle they refuse to do car advertisements.

[1] https://www.team-bhp.com/forum/

wefarrell2y ago

I'm more concerned for authors of published works.

It's a legal gray area and they've got teams of lawyers whereas you do not.

jjav2y ago

> realize a major risk with the current structure of online communication

I wish this lesson could be learned once for all.

> We need to decentralize our communications

briffle2y ago

I'd argue its not just forums, but other key parts of the internet. Like Microsoft training co-pilot AI on github code, but not following the licensing of some code they straight up copy and suggest.

I'm kind of curious what is next.

xen2xen12y ago

The risk of centralized systems was discussed long ago. The Cathedral and the Bazarr was published in 1999. None of these ideas are new. Everyone who payed any attention knew it was coming.

jupp0r2y ago

Incentives in this structure go both ways though, ideally keeping everything in symbiotic balance. Companies that alienate their users tend to not do well shortly after.

nightski2y ago

dylan6042y ago

shagie2y ago

https://en.wikipedia.org/wiki/Network_News_Transfer_Protocol

RFC 977 ( https://www.w3.org/Protocols/rfc977/rfc977 ) has an example of how files are copied between the two systems (section 4.6) including fetching and receiving mail.

Note that not all posts outbound are necessarily of interest to the other server. An IHAVE message could come back with either a "I want it" type response or a "not interested"

That's how some of the moderation worked - your server would say "I don't want anything that came by way of X host" or "not interested in that newsgroup."

One of the amusing things to me (looking back at this), if you're familiar with HTTP response codes, you'll likely get most of the way through the NNTP ones.

   200 server ready - posting allowed
   400 service discontinued
   411 no such news group
   500 command not recognized

I'd also suggest a read of RFC 850 ( https://www.w3.org/Protocols/rfc850/rfc850.html ) for some other background and section 5: The News Propagation Algorithm

dylan6042y ago

2 more replies

ilyt2y ago

That doesn't stop grumpy admin of a federated instance to just do exactly same thing.

> run by people we trust

People change, or retire, just like corporation goals change.

Focusing on more independent is not enough. If you want truly unbreakable stuff first part of the puzzle is saving user's handle and identity in a way that can't be removed.

Then finding out a way to link that to their content so when place of hosting it goes away people can follow to the new place

Then just have all of that content be signed by that identity so users can verify that it is really that person.

And I can't believe I'm saying that unironically but blockchain might just be the solution for that.

Something like immutable log of:

etc.

Then when server admin inevitably goes rogue you can take your posts and subscribers and go somewhere else.

And when @example.com owner decides "well I'm just gonna to redirect stuff to ads", you can just change your handle and direct people to right place, and other handle is forever taken.

scarface_742y ago

> when all they did is provide the platform, and store the data

And all Google did was build a search engine.

erlend_sh2y ago

Agreed, and working on it: https://blog.erlend.sh/assembling-community-os

kiba2y ago

Perhaps these sort of things shouldn't be for profit enterprise, given the inability of companies to not slaughter the goose that lay the golden eggs.

phailhaus2y ago

The problem is defining "these sorts of things". StackOverflow didn't do anything evil, they created a useful website and people flocked to it voluntarily.

rektide2y ago

The world keeps going dark. What a terrible era.

karim792y ago

I never wanted any money from SO, but the revelation of this attitude has left me feeling, well, a little sad to say the least.

[0]https://stackoverflow.com/users/70393/karim79

abetusk2y ago

I think I see.

The economies of the internet are changing. Now with LLMs being accessible at an exponentially cheaper rate, we're seeing old models crumble and new models rising.

The era of moderated user content is changing drastically and the stalwarts of social networking, or adjacent, services are closing ranks to try and anticipate the change.

karim792y ago

History repeats itself albeit in a fascinating way which I'm still trying to grasp.

I don't blame SO. I think they are acting rationally and as anyone would facing such a threat.

Side note, I'm running llama on a really crappy old server and that was enough to convince me that I'll be able to run an LLM on my watch in the near future.

hyperbovine2y ago

karim792y ago

> Fair enough, but you were fully aware of this arrangement going in, and chose to participate.

Yep, but that's not the point.

> SO didn't opt into being training data for ChatGPT, and I doubt they would have given the chance.

It would have been nice if they had at least allowed their contributors to vote on this, or have some sort of a say.

baliex2y ago

> "it's free and open information built by the community, but when something disrupts our ads-driven business model and we make it unfree"

ryan292y ago

karim792y ago

Honestly I think that's an excellent idea - a rep "passport" of sorts which gains you a certain level of trust within certain communities.

That would be great, but I'm pretty sure that no LLM corporation would care about those flags, even with strict regulations in place from governments.

ryan292y ago

That's the scariest thing I've heard today. Lol.

bioemerl2y ago

I hope the data that has been found so far is going to big enough going forward, but it's incredibly unfortunate that this is happening.

I hope all the people making these decisions wake up with a bad headache and severe heartburn tomorrow.

CoastalCoder2y ago

IANAL, but I'm curious:

Would that violate antitrust laws in various major markets?

semi-extrinsic2y ago

Nitpick, also because the contrast is kind of funny:

mote: a small particle, speck, atom, "mote of dust"

moat: a deep ditch, often filled with water, as a first line of defence around a castle.

btown2y ago

Hopefully this comment won't be demoated by the algorithm - it truly holds water on its own!

1 more reply

bioemerl2y ago

Given that this seems to happen all the time without antitrust issues it probably wouldn't, even though I feel like it should.

What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

josephcsible2y ago

> What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

1 more reply

endisneigh2y ago

> What we need is a legal way for companies to keep the data open, but also require OpenAI and friends to pay them for it.

inherently not possible as then it would not be "open" to begin with.

2 more replies

grumbel2y ago

> It's unfortunate we are seeing all of these data platforms get locked off

Are there any AGPL-like licenses that address this?

klooney2y ago

Oof. This was one of the big central tenets of SO, the reason it wasn't Experts Exchange 2.0- the escrow of the community's contributions.

RandallBrown2y ago

If you knew one simple trick all the answers on Experts Exchange were at least freely available.

That trick was to simply scroll past the paywall. They had all the answers exposed so that google would index them. It was hilarious and silly.

marcosdumay2y ago

Back in the time when Google didn't play favorites on companies not following their terms of service.

abetusk2y ago

As a reminder, all the SE sites have content under a Creative Commons, By Attribution, Share Alike license, allowing for, among other things, commercial re-use [0] [1].

To me, this is yet another reminder of why we need to future proof with libre/free/open licensing terms. Or reform copyright, but I don't see that happening within my lifetime.

[0] https://stackoverflow.com/legal/terms-of-service/public#lice...

[1] https://creativecommons.org/licenses/by-sa/4.0/

[2] https://www.redditinc.com/policies/developer-terms#text-cont...

[3] https://www.ycombinator.com/legal/#tou

belter2y ago

The change to 4.0 was done without permission according to many in the comunity.

"Stack Exchange doesn't have the right to unilaterally change the license of previously submitted content." - https://meta.stackexchange.com/questions/333089/stack-exchan...

arp2422y ago

Older posts are under the older CC 3.0 license, newer posts under the CC 4.0.

https://meta.stackexchange.com/questions/344491/an-update-on...

ignoramous2y ago

Should petition Daniel Gackle et al to CC-BY user-generated content on HN.

usr11062y ago

I haven't studied the legalize, but I assume if they put all answers behind a paywall tomorrow nothing can be done. I don't think the license says they must share.

abetusk2y ago

The intent is to prevent exactly this type of possibility. They may be able to put it behind a paywall and copyright future work but not work that's already been published.

Obviously not legal precedent, but there is some discussion on the matter by the Creative Commons organization [0].

[0] https://creativecommons.org/faq/#what-happens-if-the-author-...

expertentipp2y ago

TX81Z2y ago

We just needed enough data to awaken the mega mind, now we may rest and the mega mind shall bring an era of peace, prosperity, and scientific achievement.

Praise the mega mind.

KirillPanov2y ago

All hail the mega mind.

keyle2y ago

Yikes. Reddit. Stack overflow. It's all going south.

Maybe we won't even have to wait for LLMs to destroy the web we used to know.

ori_b2y ago

This is LLMs destroying the web we used to know.

llm_nerd2y ago

ryan292y ago

> There are exceptions, obviously. There are true experts who wander the virtual halls of StackOverflow and dole out wisdom.

For example, via ChatGPT...

Q: How do I protect against the scrub of death with ZFS?

Except it's not a real thing [1]. For bonus fun, read the comment right after Ahrens'.

What's more frustrating than getting an answer on a programming question and taking hours to figure out that it was complete BS and doesn't work as described?

I think the real debates are going to come in the future if SO releases a paid LLM product that's trained on community contributed content and rankings.

1. https://arstechnica.com/civis/threads/ars-walkthrough-using-...

nunobrito2y ago

Time to adopt Nostr as future-proof path.

zhte4152y ago

Without movement on this [1] I can't see adoption.

[1] https://github.com/nostr-protocol/nostr/issues/97

klabb32y ago

1 more reply

nunobrito2y ago

That your own personal opinion, which is contrary to the growth metrics.

Nostr community is about true freedom to write just about anything.

nerdo2y ago

nunobrito2y ago

There is a quintessential feature that differs nostr from any other social network in the past decades: Your private key proves that you wrote a specific text.

jrnichols2y ago

Something else will come up, until the endless quest for advertising revenue catches up and ruins that as well.

8organicbits2y ago

Really strange comment.

> I was recently impacted by the Company's layoff.

> I'm offering what I can to uphold the Company's values of Transparency & being Community-centric.

I wouldn't offer transparency about a former employers internal operations. Let them respond or at least ping a current employee to respond.

arp2422y ago

There may be an NDA involved. And staying on good terms with previous employers (or at least not burning any bridges) is generally a good idea regardless.

dzaima2y ago

shagie2y ago

For any curious, the original announcement of the data dump - https://stackoverflow.blog/2009/06/04/stack-overflow-creativ...

dylan6042y ago

"Just sorta stating the obvious here, but the timing of this is unbelievably terrible; I actually can't fathom a worse time for this call to be made than in light of this week. –zcoop98"

irowe2y ago

marcosdumay2y ago

It also means people are more motivated to build a replacement than just by the timewasting reddit being unavailable.

sitkack2y ago

Replace them both with a model somewhat like Wikipedia, open the content for the world, and get a cut of profits from the corporations that want to use the data to train on it.

marcosdumay2y ago

1 more reply

senko2y ago

I only hope this and the Reddit slowmo-trainwreck-in-progress sensitivise more people about the value of the data they contribute and how it is appropriated by the platforms.

nightfly2y ago

Each contribution, and most individual contributer, is worthless though. They only have value in aggregate.

dpedu2y ago

> I mention the timing, as this change long pre-dated the current moderator strike and related policy changes.

A mod strike? I hadn't heard about this.

https://meta.stackexchange.com/questions/389811/moderation-s...

mdaniel2y ago

the thread: https://news.ycombinator.com/item?id=36192497

yeldarb2y ago

Sad, I had a lot of fun with it making StackRoboflow[1] (This Question Does Not Exist) a few years ago.

The models (AWD-LSTM and GPT-2) weren't good enough back then to usefully answer programming questions -- but it's super cool to see that vision realized with GPT-4 and other modern LLMs.

[1] https://stackroboflow.com

drubio2y ago

Yesterday's data dumps/APIs fostered community, new market/channel discoveries & low risk acquisitions.

animatethrow2y ago

jmyeet2y ago

If the data dump is gone, that compact is broken and honestly it's time to stop contributing to SO.

jupp0r2y ago

Wild guess: somebody came up with a business plan to monetize all that data for future LLM usage.

nologic012y ago

twitter, reddit, stack overflow... the digital version of burning the library of alexandria

it was always a broken system built on dodgy contracts, but it is still sad to see how unceremoniously everything implodes

will any lessons be learned? unlikely.

jstarfish2y ago

matsemann2y ago

The answer mentions a layoff. I haven't caught wind of that. What happened?

JasonPunyonOP2y ago

https://stackoverflow.blog/2023/05/10/a-message-from-prashan...

nolok2y ago

Stackoverflow has over 500 employees ?!

floydian102y ago

I don't know why people are so often surprised about the number of employees in a company. My company has half the number of employees, we're not remotely as relevant as SO.

Kiro2y ago

Why is that surprising?

1 more reply

albertzeyer2y ago

Is this such a big problem? You could still scrape all the data, or not?

wolfgang422y ago

[1]: https://search.feep.dev/blog/post/2021-09-04-stackexchange

Etherlord872y ago

albertzeyer2y ago

dahwolf2y ago

This is an internet ecosystem issue that is simplified to thoughtless bashing of supposedly evil companies. Yes, these actions are clumsy and user-hostile but consider the big picture.

We have companies like Reddit and Stackoverflow not being profitable, despite being wildly successful in usage and internet mind-share. Neither of these companies are particularly over-staffed.

We post our "valuable" contributions there. So valuable that nobody wants to pay for it (structurally). We block ads. AI does the daylight robbery. We expect free APIs and data dumps.

Perhaps this is our wake-up call. The limitations of the "free" model and companies running at a loss for 15 years straight. It was always an anomaly.

tmvnty2y ago

Not sure if this is relevant, but the Hacker News BigQuery dataset also stopped updating since Nov 2022: https://issuetracker.google.com/issues/261579123

6346363462y ago

As a silver lining, perhaps the cash-grab, zero value-added clones will no longer clutter our google results?

cratermoon2y ago

bagasme2y ago

I guess this is a defensive move against being inadvertently used for ChatGPT model.

pixl972y ago

I think you mean more like a "thieves have stole the horse! quick close the barn door"

marginalia_nu2y ago

Cat's of out of the bag already with that one.

LastTrain2y ago

There is a time when the bill comes due for any "free" service.

KingOfCoders2y ago

The friends you thought you had weren't

endisneigh2y ago

Not surprising - why would any content driven business want all of their stuff to be vacuumed up for free?

mindcrime2y ago

arp2422y ago

> they are fundamentally changing the basic understanding between users and SE

dylan6042y ago

jefftk2y ago

urbandw311er2y ago

I wonder how many of those contributors, if re-consulted, would sign up to having their contributions used to train a for-profit LLM though?

I certainly didn’t sweat it out helping people on SO to pay for Sam Altman’s fucking swimming pool.

whatyesaid2y ago

I mean they would just scrape it if there's no data dump. It just makes it harder for the small guys. They probably scraped and are scraping HackerNews.

Generative AI doesn't follow copyright or even explicit software licenses as we have seen in AI art with human signatures and Microsoft Copilot.

2 more replies

blihp2y ago

jefftk2y ago

j / k navigate · click thread line to collapse