undefined | Better HN

0 pointspeterwwillis7y ago0 comments

We can already do this - but it's a bad idea.

Plenty of services are API compatible with Amazon S3 (e.g. anyone can run their own S3 clone) so people can modify existing sites to use S3 with OAuth. Use OAuth to allow the user to delegate access to their S3 service link. No new protocols needed, no big innovations required.

But for this to work on anything other than the most rudimentary data (media files, blog posts, and serialized data) would require completely changing the way all modern applications are written. Databases would all have to change, APIs would all need to follow specific standards, and networks would need to become a hell of a lot more stable, higher bandwidth, and lower latency.

Assume you're Twitter, and you want to map-reduce all of the data of all your users to find out how many people retweeted a user, and then notify those users. Now you need to connect to every user's service provider, get their data, store it temporarily on your own servers, duplicate everything, do your processing, and then write changes back to all storage services for all users. Now do this every second. If you don't, you have to store this map-reduced data on your own service's storage, which violates the principle of only using the user's storage pod.

In fact, data would have to become more centralized to work in this model. Currently, application data exists across a range of services in a variety of networks, all of it being dynamically accessed in different ways before it is accessed by a user. There are dozens of different databases used just to open up the TV Guide on your cable company's set-top box. All of that would have to be centralized in one or two databases in order for the storage and processing to be disconnected.

Not only that, but a lot of data is useless to anyone but the original service provider or original application. Only a Facebook clone would be able to use Facebook's data, and only data relevant to Facebook's ad sales should stay on Facebook's servers, even if it contains "Peter clicked on ad X at Y time". Should there be a separation of what kind of data gets decentralized? Do we really want to go down the rabbit hole of what is my data, and what is data about me that a company has originated and created value from? (Is a picture mine because it's a picture of something I own, or is it mine if I took the picture?)

The idea that every component of every application could be completely decentralized from each other is unlikely. Now, what is more in the realm of possibility is doing a Google or Facebook, and creating features that allow exporting or importing all data. But that process is not perfect, and the procedure can take from minutes to days. And to use this data it would still all have to follow standards specific to a particular application.

And again, we already have a lot of these data standards. We have standards for most of the kinds of data that exist today, such as calendar, contacts, e-mail, instant message, voip, office documents, images, and so on. We have standards to synchronize and syndicate data feeds. We have standards to federate accounts and manage permissions. But commercial sites don't natively build these features as interoperable with each other - because, why would they?

Storage and processing of data are intimately connected with the specific applications that use them, and trying to decouple them will result in inefficiency and complication, with no clear advantages.

0 comments

kjetilk7y ago

OK, so nobody said decentralization is easier. There's been plenty of academic papers saying pretty much the same as you do. But we have to, not for technical reasons, but for ethical and social ones. So, we're starting to tackle it head on.

Your TV Guide is a good example of things that aren't hard. They don't change very quickly, so you can just use a cache. That's easy.

Finding the number of RTs, that's also easy, apart from it being an open world of course. When they RT, they notify you. And you want to display those RTs with your tweet? Just cache those who notified you.

Stable data access standard? That is Solid itself. And the data model, that's RDF.

There are ways that you can go about doing this stuff.

Finally, we're also getting some traction around this in academia, they've been hung up in stuff that isn't helpful for too long.

peterwwillisOP7y ago

Actually, the TV Guide example uses data that updates constantly. Every single interaction a user has is recorded and is used by other systems. The guide also changes based on user-specific views or preferences. Another example is Netflix's famous user-specific recommendations, which, changing constantly and whose algorithm is regularly fine-tuned, is a strategic feature. Even just playing a single show requires a dozen different calls to authorize its playing, based on a number of considerations.

Finding the number of retweets is also more difficult, because there's other data that gets recorded too. Not only do you have your own data now, you now have the data of everyone else that retweeted you. Is it your data, or theirs? Who is caching it, and how long? How does refreshing the cache effect consistency of each user's views? With decentralized applications you have to choose what kind of functionality you will support.

But, yes, in theory, if you allowed only one service provider to use some given data, you could rely on caching (read: holding a copy of data indefinitely) to a good extent. But as soon as you have multiple using it, you enter the extremely hairy world of multi-master high-availability strong-consistency replication. AKA, absolute hell. But this isn't even the most difficult problem to me.

We already had some good data access standards. The question is, why weren't sites using them to allow data interoperability/mobility? Answer: they didn't want to. So even if you create a technical solution for all of this, the best you will get is the Facebooks of the world publishing a read-only calendar feed, clunky, slow export tools, and single-feature one-way application integrations. Like we have now.

I don't see an ethical or social reason to decouple the data from the services I use, and I don't think the majority of the world population does, either. The only ethical/social concern I have is with the very existence of the service, which is a different concern.

j / k navigate · click thread line to collapse