There were many interesting design decisions, such as:
- whether to use multiple snapshots or a single snapshot per statement
- how to handle read uncertainty intervals
- how to incorporate SELECT FOR UPDATE locking into Raft
- how to handle SELECT FOR UPDATE subqueries
- how to prevent lost update anomalies between two UPDATEs
Some of the gory details are in the public RFC: https://github.com/cockroachdb/cockroach/blob/master/docs/RF...
This blog post just discusses the last point, but please AMA.
It allows readers to see valid data (relationships are correct), while not blocking writers. It can be the difference between constant deadlocks and super-high throughput without lock contention.
Maybe as an option with a new syntax (UPDATE .. WHERE .. RETRYING) or something like this.
I'd speculate this is because postgres and friends try to eek out every bit of single node performance (which helps with single row throughout and overall throughout, which is obviously much better for them than newsql) but the scalability of new SQL databases might allow them to prefer easy consistency over single node performance.
Possibly this is also just the passage of time benefiting newer systems.
There's also the fact that decades of DB research have brought techniques and approaches that beat old ones, and retrofitting existing systems with them can be hard (e.g. see the efforts to remove some of the sharp edges of PG's MVCC behavior and how hard they've turned out to be).
Transactions at all isolation levels are subject to lock contention, where a transaction attempts to lock a row that is already locked by a write or
locking read. In such cases, the later transaction is blocked until the earlier transaction commits or rolls back, thus releasing its lock on the row.
Lock contention that produces a deadlock between two transactions will result in a transaction abort and a 40001 error
(ABORT_REASON_ABORTED_RECORD_FOUND or ABORT_REASON_PUSHER_ABORTED) returned to the client.
So looks like you still get a good old 40001 error just like with SERIALIZABLE isolation.[0] - https://www.cockroachlabs.com/docs/stable/read-committed#rea...
If an UPDATE has to retry halfway through, locks are held across the retry to help the system make progress. But as you point out, this could cause lock acquisition to happen in an unexpected order if new rows qualify during a retry. So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock. It depends on the workload.
I'm not sure what you mean by that: the design can deadlock but you just have not seen it happening yet?
Edit: oh i see in a comment bellow that deadlocks are detected and abort the transaction.
If `UPDATE player SET level = 'AA' WHERE team = 'Gophers';` is executed before the player swap, then why should "Stonebreaker" be upgraded to "AA"? I'd be pretty mad at my database if I sent those 2 queries in sequence and my DB decided to re-order them.
The sleep actually really complicates things here. I understand some queries run slower than others and the sleep is a useful tool to artificially slow things down, but now I don't know I don't know if I should interpret that as one command or two. If `WITH sleep AS (SELECT pg_sleep(5)) UPDATE player SET level = 'AA' FROM sleep WHERE team = 'Gophers';` is atomic then I'd expect it to put a lock on the 3 Gophers (which doesn't include Stonebreaker), wait the 5 seconds and then complete the update. The player swap would be blocked for those 5 seconds because it touches a row that's being updated.
Here's a full timeline in PG (U1 for first update, U2 for second update):
0. U1 begins executing, establishes read snapshot, starts pg_sleep(5).
1. U2 runs to completion.
2. U1 wakes up after 5 sec, scans `player` using snapshot from step 0.
3. U1 filters `team = Gophers`, gets 4, 5, 6.
4. U1 locks 4, 5, 6.
5. U1 performs EvalQualPlan: re-scans latest version of those locked rows, which sees U2's write to 4 but not to 3.
6. U1 performs EvalQualPlan: re-filters those locked rows using latest version, gets 5, 6.
7. U1 writes new versions.
CRDB is easier to reason about: after U1 wakes up from the sleep, it sees that it conflicts with U2 and simply retries the entire statement.
It’s easy to miss but in the swap query the levels also get swapped. Because — and it’s harder to miss but easy to skip over — given what constraint 2 says the level is actually a team level, not a player level.
So in a seqcst view, either the team’s players get upgraded to AA then the players’ levels get swapped during the exchange, or the players get exchanged then the team’s players get upgraded.
In both sequences you end up with Stonebaker as an AA gopher and Lamport as an A dolphin.
Then it's bad unnormalized data design that is the problem here. If that is a team level, it should be in the team table, not the player table.
ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications: http://www.bailis.org/papers/acidrain-sigmod2017.pdf
So probably if you would be turned off enough by the name not to use the software, you don't actually need a distributed SQL database and are not the target customer.
Gyms know that people that sign up on Jan 2 don’t need annual (or any) membership. They take their money anyway. The more honorable ones allow cancellation or something but I don’t think anyone expects them to actively turn these folks away.
It's meant to imply indestructability, not creepy yucky beastliness.
Obviously that goes over the heads of many people initially but at least it can then be explained!
They could have named this waterbear or adamantium or something.
As part of building read committed, we introduced other forms of replicated locks. In particular, replicated shared and exclusive locks (SELECT FOR UPDATE statements acquire exclusive locks). Locks acquired by read committed transactions are always replicated.
The unit of replication in CockroachDB is called a range, and every range is its own raft group. Every range has a reserved portion of its keyspace where locks can be stored. This is known as the replicated lock table. Incorporating exclusive/shared locks in the replicated lock table involved adding a notion of lock strength to all locks that were stored (as previously they were all assumed to be intents).