I've done distributed systems, and in fact I usually don't implement systems that are 100% correct under maximally-adverse conditions, for performance and convenience reasons. But it's important to know and acknowledge where and how the system is less than 100% bullet-proof, so that you can understand and recover from failures. When you have 1000 servers, some virtual in the cloud, some in a colocation facility, and maybe 10 or 20 subtle variants in hardware, you will get failures that should not happen but they do. For example, I've had an AWS host where the system time jumped backwards and forwards by 5 minutes every 30 seconds. And I've seen a tcp load-balancer appliance managed by a colo facility do amazingly wrong things.
The point is, you need to understand where the consistency scheme in a distributed system is "fudged", "mostly reliable", "as long as conditions aren't insane", and where they are really guaranteed (which is almost nowhere). Then you can appropriately arrange for "virtually impossible" errors/failures to be detected and contained. This ad-hoc logic you bring to the problem of distributed systems doesn't really help.
Not to "dogpile", but elucidate for others who may not know some of the history here: https://aphyr.com/posts/287-asynchronous-replication-with-fa... and the also interesting reply: http://antirez.com/news/56
About the ad-hoc logic, in this case I think the problem is actually that you want to accept only what already exists. For example there are important distributed systems papers where the assumption of absolute time bound error among processes is made, using GPS units. This is a much stronger system model (much more synchronous) compared to what I assume, yet this is regarded as acceptable. So you have to ask yourself if, regardless of the fact this idea is from me, if you think that a computer, using the monotonic clock API, can count relative time with a max percentage of error. Yes? Then it's a viable system model.
About you linking to Aphyr, Sentinel was never designed in order to make Redis strongly consistent nor to to make it able to retain writes, I always analyzed the system for what it is: a failover solution that, mixed with the asynchronous replication semantics of Redis, has certain failure modes, that are considered to be fine for the intended use case of Redis. That is, it has just a few best-effort checks in order to try to minimize the data loss, and it has ways to ensure that only one configuration wins, eventually (so that there are no permanent split brain conditions after partitions heal) and nothing more.
To link at this I'm not sure what sense it makes. If you believe Redlock is broken, just state it in a rigorous form, and I'll try to reply.
If you ask me to be rigorous while I'm trying to make it clear with facts what is wrong about Martin analysis, you should do the same and just use arguments, not hand waving about me being lame at DS. Btw in my blog post I'll detail everything and show why I think Redlock is not affected by network unbound delays.
But that's optimistic concurrency control! In that case the locking service is a pure optimization and doesn't affect correctness. So why use this complex multi-node locking thing then, instead of a single redis node?
Along with the author of this blog post, I would recommend the usage of Zookeeper if you have a need for obtaining a lock in a distributed environment. You can read an analysis of how Zookeeper performs under a (intentionally) partitioned network here:
Interestingly, even though HBase uses Zookeeper for coordination, it does not use it for fencing. Instead, fencing is handled by atomic "log-rolling" operations on the HDFS namenode, which can be configured to use its own quorum-based journal. (The equivalent of a "token" is the monotonically-increasing namenode operation sequence number.) So the principle is the same: the system responsible for doing mutual exclusion, and the system that actually stores the data, must be coupled.
To achieve unlimited scalability, you need to make sure that no single data store is shared between all processes and that the amount of interprocess communication (on a per-process basis) doesn't increase as you add more processes.
The more sharing you have, the less you can scale your system.
Personally, I find it easier to write highly parallel code than parallel-up-to-a-certain-point code that involves locks likes mutexes and/or semaphores. Highly parallel code just takes a little bit of extra planning and thinking up-front but it saves lots of time and stress down the line.
This statement confused me. It seems to say that the packets were delayed in the network for 90 seconds before being delivered. From reading the original sources it actually sounds like packets were discarded by the switches, so the original requests discarded, and the nodes were partitioned for 90 seconds. When the partition was removed both nodes thought they were the leader and simultaneously requested the other to shutdown. Can anyone confirm? Keeping packets delayed in a network for 90 seconds would seem quite difficult (though not impossible assuming certain bugs).
Edit: On re-reading I think this is just talking about the network stack in general - not the network. A temporary partition may delay delivery of your request until max TCP retries is exceeded on your host, if its recovered before then your request may arrive later than you intended.
However, yes, I was thinking about the network stack as a whole. Any kind of retries, e.g. TCP retransmission on timeout, effectively turns packet loss into packet delay (within certain bounds). Thus, even if you have a network interruption ("partition") and all packets are dropped, it could happen that after the interruption is fixed, a node receives packets that were sent before or during the interruption. For this reason, I find it helpful to think of it as delay, not just packet loss.
Edit: Why is this downvoted? It is a serious question.
In the scenario you describe, the database commit works, you send the email but... you get an error back from the email send function. Now what?
This is a basic distributed computing problem. Choosing how to solve it can have a drastic effect on your code and your infrastructure. If the secondary function is "charge a credit card" obviously you can't do that twice. But if it's merely "send an email" then maybe it's okay if people get a duplicate.
Google lease/lock/deadlock/race condition and read up on why databases tend to make for bad implementations.
Having just re-read Lynch's paper, can you explain what you mean here? I didn't see anything explicitly relying on time. It could be there is some implicit usage I didnt see. Additionally, the paper's impossibility result is about "perfectly correct consensus" which applies with and without clocks and then has a positive result for "partially correct consensus" (i.e. not deciding a value is a correct result). Im not sure which you mean when you say "consensus becomes impossible" as it is either already impossible (the perfectly correct protocol) with one faulty process or (to my understanding) not dependent on time (the partially correct protocol).
p.s. great article!
However, if you do allow some timing assumptions — just enough to measure a timeout, nothing more — then consensus becomes possible. In this case, the timeout is known as an "unreliable failure detector". See Chandra and Toueg's paper for details.
In a good algorithm, the safety properties do not depend on timing, only the liveness properties do. Rather than "consensus being impossible" perhaps it would be clearer to say "consensus may never terminate" in an asynchronous system. But "consensus impossible" is the standard way of phrasing this issue in the distributed systems literature.
For example, on page 375: “Crucial to our proof is that processing is completely asynchronous; that is, we make no assumptions about the relative speeds of processes or about the delay time in delivering a message. We also assume that processes do not have access to synchronized clocks, so algorithms based on time-outs, for example, cannot be used.”
1. http://www.cs.princeton.edu/courses/archive/fall07/cos518/pa...
I send a packet over the network. I configured the timeout to be 10 seconds. 10 seconds passes. Is the packet truly lost? Or, maybe the packet was received, processed, and and a response is on the way back, but it takes 20 seconds instead of 10! Or maybe it takes an hour instead of 20 seconds. Or a decade.
We don't know if the packet is truly lost or is just taking >$timeout_timer and making assumptions is dangerous.
Please consider this answer [0] (which I personally understood as - YES, real systems are always partial sync): "Asynchronous systems have no fixed upper bounds. In practice, systems tend to exhibit partial synchrony, which is described as one of two models by Dwork and Lynch in Consensus in the Presence of Partial Synchrony. [1]"
The specs are submitted for review at https://github.com/openstack/openstack-specs
https://github.com/openstack/openstack-specs/tree/master/spe...
As Martin points out though, many (most?) distributed lock implementations are or can be broken in various ways. He hints at one of the fundamental problems - consensus. Many distributed lock implementations fail simply because they cannot achieve reliable consensus, or don't even try to.
That said, distributed locks can be safe and handle reasonably high throughput. The Atomix distributed lock is one example:
http://atomix.io/atomix/docs/coordination/#distributedlock
Since consensus requires quorum and quorum requires availability, there is a risk that your ability to obtain a lock or learn about lock related events could be effected by availability, but at least in the case of Atomix, the system is fairly resilient with auto-failovers to passive or inactive nodes as needed (as compared to, say, a ZooKeeper based lock).
[1] http://docs.oracle.com/cd/B28359_01/server.111/b28318/consis...
If you want to acquire some shared resource, then elect an owner for that resource.
When you perform a write, instead of the token, send a hash of the object previously read. The storage can then compare this against a hash of the resource's current state. If it doesn't match the lock expired and the write is not accepted.
This would reduce the state to keep track off to the resources themselves.
Generally when I think of this problem I reach for etcd, not Zookeeper first, in the hopes of it being lighter (with a relatively vague definition of "light"), and easier to use.
Consensus can be achieved without clocks using blockchains: Instead of timelocking the resource, client can broadcast his altered version after making changes to the network. The next client than then start working on top of this changed resource, but since multiple versions might be floating around (due to timing issues in the first place) the longest tree wins. So if a client submits work done on a shorter tree then it is rejected by the network.
This has other issues like it takes longer time & reorganization risks but it does away with clocks altogether by providing a different method of timestamping.
This is similar to proof-of-work timestamp server that bitcoin uses but we can do away with proof-of-work because the resource and membership in the network is centralised.