Why CockroachDB doesn't use EvalPlanQual (opens in new tab)

(cockroachlabs.com)

135 pointsmichae22y ago58 comments

58 comments

michae2OP2y ago

Author here. We've spent the past year adding read committed isolation to CockroachDB.

There were many interesting design decisions, such as:

- whether to use multiple snapshots or a single snapshot per statement

- how to handle read uncertainty intervals

- how to incorporate SELECT FOR UPDATE locking into Raft

- how to handle SELECT FOR UPDATE subqueries

- how to prevent lost update anomalies between two UPDATEs

Some of the gory details are in the public RFC: https://github.com/cockroachdb/cockroach/blob/master/docs/RF...

This blog post just discusses the last point, but please AMA.

sbstp2y ago

Why is Cockroach adding READ COMMITTED? Is using a lower level of isolation better for performance or just reduces the amount of serialization errors and retries that need to be done?

kogir2y ago

READ COMMITTED is great for applications that need a coherent snapshot of the database but not necessarily the absolutely most recent data, which in my experience is actually most apps.

It allows readers to see valid data (relationships are correct), while not blocking writers. It can be the difference between constant deadlocks and super-high throughput without lock contention.

michae2OP2y ago

The main motivation is reduce serialization errors, for applications that can handle the weaker isolation level. Especially for applications that were previously running fine under RC on another database.

nextaccountic2y ago

Can this improvement also be implemented in Postgresql, removing the need for EvalPlanQual?

Maybe as an option with a new syntax (UPDATE .. WHERE .. RETRYING) or something like this.

foota2y ago

It's often talked about how new sql databases offer better scalability than standard SQL databases, but I think it's maybe sometimes underappreciated how (some, not all) of them are also much simpler in terms of their consistency models.

I'd speculate this is because postgres and friends try to eek out every bit of single node performance (which helps with single row throughout and overall throughout, which is obviously much better for them than newsql) but the scalability of new SQL databases might allow them to prefer easy consistency over single node performance.

Possibly this is also just the passage of time benefiting newer systems.

wbl2y ago

Read committed is explicitly asking for hard mode. If you want a simple life stick with Serializable as always. It took years before people found anomalies in Repeatable Read in Postgres. This stuff is hard even for world class researchers.

paulryanrogers2y ago

How does one get serializable in a multi-writer system without a lot more locking and having to retry at app layer?

Diggsey2y ago

You always need app-level retries for SERIALIZABLE isolation level. You don't need any explicit locking - the database should handle that for you (and in the case of PostgreSQL, locking is not the only tool it uses for avoiding serialization anomalies).

The strategy I use is to keep transactions as small as possible, and have retry functionality built into the transaction abstraction, so the buesiness logic doesn't really need to worry about it. I also explicitly use read-only transactions where possible.

mjb2y ago

Even more generally, distributed systems can find simpler solutions to things like "raise the throughput ceiling", and "handle disk failure", and "handle power failure" than single-box systems. This is for the simple reason that they have more options: beyond the constraints of a box, resource allocation is more flexible, failures less correlated, etc. That allows modern distributed databases to simply avoid some of the super hard problems that prior databases had to solve. Efficiency is still important, but the thing to optimize is mean system efficiency, not the peak performance of a handful of super hot boxes.

There's also the fact that decades of DB research have brought techniques and approaches that beat old ones, and retrofitting existing systems with them can be hard (e.g. see the efforts to remove some of the sharp edges of PG's MVCC behavior and how hard they've turned out to be).

CGamesPlay2y ago

How does the CockroachDB approach not deadlock? Surely retrying could encounter a situation where two competing UPDATE will lock rows in different order, and no amount of retrying will unlock the required rows, right?

dilyevsky2y ago

I wondered this too and found this in the docs[0]:

  Transactions at all isolation levels are subject to lock contention, where a transaction attempts to lock a row that is already locked by a write or
  locking read. In such cases, the later transaction is blocked until the earlier transaction commits or rolls back, thus releasing its lock on the row.
  Lock contention that produces a deadlock between two transactions will result in a transaction abort and a 40001 error
  (ABORT_REASON_ABORTED_RECORD_FOUND or ABORT_REASON_PUSHER_ABORTED) returned to the client.

So looks like you still get a good old 40001 error just like with SERIALIZABLE isolation.

[0] - https://www.cockroachlabs.com/docs/stable/read-committed#rea...

michae2OP2y ago

It's a good question. For simple UPDATEs, CockroachDB always executes in a deterministic, serial order and so it's likely the rows will be locked in the same order by any competing updates. (This can be confirmed by looking at the query plans.) Complex UPDATEs using joins and subqueries will need explicit ORDER BY to always lock in the same order.

If an UPDATE has to retry halfway through, locks are held across the retry to help the system make progress. But as you point out, this could cause lock acquisition to happen in an unexpected order if new rows qualify during a retry. So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock. It depends on the workload.

temac2y ago

> So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock.

I'm not sure what you mean by that: the design can deadlock but you just have not seen it happening yet?

Edit: oh i see in a comment bellow that deadlocks are detected and abort the transaction.

erhaetherth2y ago

I'm having trouble with the example given.

If `UPDATE player SET level = 'AA' WHERE team = 'Gophers';` is executed before the player swap, then why should "Stonebreaker" be upgraded to "AA"? I'd be pretty mad at my database if I sent those 2 queries in sequence and my DB decided to re-order them.

The sleep actually really complicates things here. I understand some queries run slower than others and the sleep is a useful tool to artificially slow things down, but now I don't know I don't know if I should interpret that as one command or two. If `WITH sleep AS (SELECT pg_sleep(5)) UPDATE player SET level = 'AA' FROM sleep WHERE team = 'Gophers';` is atomic then I'd expect it to put a lock on the 3 Gophers (which doesn't include Stonebreaker), wait the 5 seconds and then complete the update. The player swap would be blocked for those 5 seconds because it touches a row that's being updated.

michae2OP2y ago

The timing of this example is tricky because the two update statements execute concurrently (which is only possible under read committed isolation; under serializable isolation it's much more like what you're describing).

Here's a full timeline in PG (U1 for first update, U2 for second update):

0. U1 begins executing, establishes read snapshot, starts pg_sleep(5).

1. U2 runs to completion.

2. U1 wakes up after 5 sec, scans `player` using snapshot from step 0.

3. U1 filters `team = Gophers`, gets 4, 5, 6.

4. U1 locks 4, 5, 6.

5. U1 performs EvalQualPlan: re-scans latest version of those locked rows, which sees U2's write to 4 but not to 3.

6. U1 performs EvalQualPlan: re-filters those locked rows using latest version, gets 5, 6.

7. U1 writes new versions.

CRDB is easier to reason about: after U1 wakes up from the sleep, it sees that it conflicts with U2 and simply retries the entire statement.

masklinn2y ago

> If `UPDATE player SET level = 'AA' WHERE team = 'Gophers';` is executed before the player swap, then why should "Stonebreaker" be upgraded to "AA"? I'd be pretty mad at my database if I sent those 2 queries in sequence and my DB decided to re-order them.

It’s easy to miss but in the swap query the levels also get swapped. Because — and it’s harder to miss but easy to skip over — given what constraint 2 says the level is actually a team level, not a player level.

So in a seqcst view, either the team’s players get upgraded to AA then the players’ levels get swapped during the exchange, or the players get exchanged then the team’s players get upgraded.

In both sequences you end up with Stonebaker as an AA gopher and Lamport as an A dolphin.

npstr2y ago

> given what constraint 2 says the level is actually a team level, not a player level.

Then it's bad unnormalized data design that is the problem here. If that is a team level, it should be in the team table, not the player table.

masklinn2y ago

> Then it's bad unnormalized data design that is the problem here.

That the table is not normalised makes the example somewhat confusing but it does not actually affect the issue being demonstrated. And denormalisation is a fact of data modelling.

1 more reply

michae2OP2y ago

> If that is a team level, it should be in the team table, not the player table.

It's all contrived, of course, but the reason I would consider skill level to be a player attribute rather than a team attribute is that there could be free agents with a skill level but no team:

INSERT INTO player VALUES (10, 'Pavlo', 'AAA', NULL);

Then with enough free agents, you could imagine building a new team out of free agents that are all at the same skill level:

UPDATE player SET team = 'Otters' WHERE level = 'AAA' AND team IS NULL ORDER BY id LIMIT 3;

ngalstyan42y ago

For similar isolation level anomalies in real world applications check out this SIGMOD '17 paper:

ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications: http://www.bailis.org/papers/acidrain-sigmod2017.pdf

ramchip2y ago

Great article. The `pg_sleep` is a nice trick to test for concurrency issues in postgres.

jb19912y ago

I sometimes wonder if the database would be more widely used if they had picked a different name.

big_whack2y ago

My guess is orgs that have a real need for a distributed SQL database (which are rare in the space of orgs) will make their choices based on requirements analysis rather than naming.

So probably if you would be turned off enough by the name not to use the software, you don't actually need a distributed SQL database and are not the target customer.

bradleyjg2y ago

The flip side of that is that businesses are more than okay with selling their products to organizations for which it’s overkill.

Gyms know that people that sign up on Jan 2 don’t need annual (or any) membership. They take their money anyway. The more honorable ones allow cancellation or something but I don’t think anyone expects them to actively turn these folks away.

michae2OP2y ago

To put it somewhat irreverently: running large production databases is not for the squeamish.

brianmcc2y ago

I think a lot of people miss the "Cockroaches survive everything" implication.

It's meant to imply indestructability, not creepy yucky beastliness.

Obviously that goes over the heads of many people initially but at least it can then be explained!

swsieber2y ago

I don't wonder. It would be more widely considered, at the very least, judging by interactions I've had with a few people.

dgfitz2y ago

I couldn’t imagine suggesting cockroachDB in a planning meeting for a new product. Especially when over half the room isn’t engineering. Just won’t happen.

duped2y ago

Why are you wasting the time of half the room talking about technical decisions they wouldn't understand

1 more reply

echelon2y ago

You know how "master", "slave", "blacklist", etc. cause bad feelings in a lot of people? So do terms like "roach", "cock", "cockroach", "maggot", etc.

They could have named this waterbear or adamantium or something.

Exuma2y ago

the worst name in existence

brianmcc2y ago

I think Mongo is way worse tbh

Exuma2y ago

lol like blazing saddles

CurtHagenlocher2y ago

UPDATE player set name = 'Stonebraker' where name = 'Stonebreaker'

michae2OP2y ago

Oh, thank you!

qazxcvbnm2y ago

Which of these strategies does Postgres use under Repeatable Read?

michae2OP2y ago

PG only uses EvalPlanQual under read committed isolation. Under repeatable read the first update fails with a "could not serialize" error, just as it does under serializable.

slavabobik2y ago

How you incorporate SELECT FOR UPDATE locking into Raft? Also maybe slightly off topic, but how does the CockroachDB deals with read-only queries and raft log?

arulajmani2y ago

Prior to building read committed, CRDB only supported intents as a form of replicated locks. Intents are provisional values written by active transactions which are resolved once the writing transaction commits or aborts.

As part of building read committed, we introduced other forms of replicated locks. In particular, replicated shared and exclusive locks (SELECT FOR UPDATE statements acquire exclusive locks). Locks acquired by read committed transactions are always replicated.

The unit of replication in CockroachDB is called a range, and every range is its own raft group. Every range has a reserved portion of its keyspace where locks can be stored. This is known as the replicated lock table. Incorporating exclusive/shared locks in the replicated lock table involved adding a notion of lock strength to all locks that were stored (as previously they were all assumed to be intents).

npstr2y ago

Idk to me this looks like a modeling issue of the data. There should be a team table that contains team specific data such as the skill level, then these two queries wouldn't run into any problems.

nickpeterson2y ago

Generally the best way to surmount a hard problem is to solve a different one.

j / k navigate · click thread line to collapse

58 comments

michae2OP2y ago

Author here. We've spent the past year adding read committed isolation to CockroachDB.

There were many interesting design decisions, such as:

- whether to use multiple snapshots or a single snapshot per statement

- how to handle read uncertainty intervals

- how to incorporate SELECT FOR UPDATE locking into Raft

- how to handle SELECT FOR UPDATE subqueries

- how to prevent lost update anomalies between two UPDATEs

Some of the gory details are in the public RFC: https://github.com/cockroachdb/cockroach/blob/master/docs/RF...

This blog post just discusses the last point, but please AMA.

sbstp2y ago

Why is Cockroach adding READ COMMITTED? Is using a lower level of isolation better for performance or just reduces the amount of serialization errors and retries that need to be done?

kogir2y ago

READ COMMITTED is great for applications that need a coherent snapshot of the database but not necessarily the absolutely most recent data, which in my experience is actually most apps.

It allows readers to see valid data (relationships are correct), while not blocking writers. It can be the difference between constant deadlocks and super-high throughput without lock contention.

michae2OP2y ago

nextaccountic2y ago

Can this improvement also be implemented in Postgresql, removing the need for EvalPlanQual?

Maybe as an option with a new syntax (UPDATE .. WHERE .. RETRYING) or something like this.

foota2y ago

Possibly this is also just the passage of time benefiting newer systems.

wbl2y ago

paulryanrogers2y ago

How does one get serializable in a multi-writer system without a lot more locking and having to retry at app layer?

Diggsey2y ago

mjb2y ago

CGamesPlay2y ago

dilyevsky2y ago

I wondered this too and found this in the docs[0]:

  Transactions at all isolation levels are subject to lock contention, where a transaction attempts to lock a row that is already locked by a write or
  locking read. In such cases, the later transaction is blocked until the earlier transaction commits or rolls back, thus releasing its lock on the row.
  Lock contention that produces a deadlock between two transactions will result in a transaction abort and a 40001 error
  (ABORT_REASON_ABORTED_RECORD_FOUND or ABORT_REASON_PUSHER_ABORTED) returned to the client.

So looks like you still get a good old 40001 error just like with SERIALIZABLE isolation.

[0] - https://www.cockroachlabs.com/docs/stable/read-committed#rea...

michae2OP2y ago

temac2y ago

> So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock.

I'm not sure what you mean by that: the design can deadlock but you just have not seen it happening yet?

Edit: oh i see in a comment bellow that deadlocks are detected and abort the transaction.

erhaetherth2y ago

I'm having trouble with the example given.

michae2OP2y ago

Here's a full timeline in PG (U1 for first update, U2 for second update):

0. U1 begins executing, establishes read snapshot, starts pg_sleep(5).

1. U2 runs to completion.

2. U1 wakes up after 5 sec, scans `player` using snapshot from step 0.

3. U1 filters `team = Gophers`, gets 4, 5, 6.

4. U1 locks 4, 5, 6.

5. U1 performs EvalQualPlan: re-scans latest version of those locked rows, which sees U2's write to 4 but not to 3.

6. U1 performs EvalQualPlan: re-filters those locked rows using latest version, gets 5, 6.

7. U1 writes new versions.

CRDB is easier to reason about: after U1 wakes up from the sleep, it sees that it conflicts with U2 and simply retries the entire statement.

masklinn2y ago

So in a seqcst view, either the team’s players get upgraded to AA then the players’ levels get swapped during the exchange, or the players get exchanged then the team’s players get upgraded.

In both sequences you end up with Stonebaker as an AA gopher and Lamport as an A dolphin.

npstr2y ago

> given what constraint 2 says the level is actually a team level, not a player level.

Then it's bad unnormalized data design that is the problem here. If that is a team level, it should be in the team table, not the player table.

masklinn2y ago

> Then it's bad unnormalized data design that is the problem here.

That the table is not normalised makes the example somewhat confusing but it does not actually affect the issue being demonstrated. And denormalisation is a fact of data modelling.

1 more reply

michae2OP2y ago

> If that is a team level, it should be in the team table, not the player table.

It's all contrived, of course, but the reason I would consider skill level to be a player attribute rather than a team attribute is that there could be free agents with a skill level but no team:

INSERT INTO player VALUES (10, 'Pavlo', 'AAA', NULL);

Then with enough free agents, you could imagine building a new team out of free agents that are all at the same skill level:

UPDATE player SET team = 'Otters' WHERE level = 'AAA' AND team IS NULL ORDER BY id LIMIT 3;

ngalstyan42y ago

For similar isolation level anomalies in real world applications check out this SIGMOD '17 paper:

ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications: http://www.bailis.org/papers/acidrain-sigmod2017.pdf

ramchip2y ago

Great article. The `pg_sleep` is a nice trick to test for concurrency issues in postgres.

jb19912y ago

I sometimes wonder if the database would be more widely used if they had picked a different name.

big_whack2y ago

My guess is orgs that have a real need for a distributed SQL database (which are rare in the space of orgs) will make their choices based on requirements analysis rather than naming.

So probably if you would be turned off enough by the name not to use the software, you don't actually need a distributed SQL database and are not the target customer.

bradleyjg2y ago

The flip side of that is that businesses are more than okay with selling their products to organizations for which it’s overkill.

michae2OP2y ago

To put it somewhat irreverently: running large production databases is not for the squeamish.

brianmcc2y ago

I think a lot of people miss the "Cockroaches survive everything" implication.

It's meant to imply indestructability, not creepy yucky beastliness.

Obviously that goes over the heads of many people initially but at least it can then be explained!

swsieber2y ago

I don't wonder. It would be more widely considered, at the very least, judging by interactions I've had with a few people.

dgfitz2y ago

I couldn’t imagine suggesting cockroachDB in a planning meeting for a new product. Especially when over half the room isn’t engineering. Just won’t happen.

duped2y ago

Why are you wasting the time of half the room talking about technical decisions they wouldn't understand

1 more reply

echelon2y ago

You know how "master", "slave", "blacklist", etc. cause bad feelings in a lot of people? So do terms like "roach", "cock", "cockroach", "maggot", etc.

They could have named this waterbear or adamantium or something.

Exuma2y ago

the worst name in existence

brianmcc2y ago

I think Mongo is way worse tbh

Exuma2y ago

lol like blazing saddles

CurtHagenlocher2y ago

UPDATE player set name = 'Stonebraker' where name = 'Stonebreaker'

michae2OP2y ago

Oh, thank you!

qazxcvbnm2y ago

Which of these strategies does Postgres use under Repeatable Read?

michae2OP2y ago

PG only uses EvalPlanQual under read committed isolation. Under repeatable read the first update fails with a "could not serialize" error, just as it does under serializable.

slavabobik2y ago

How you incorporate SELECT FOR UPDATE locking into Raft? Also maybe slightly off topic, but how does the CockroachDB deals with read-only queries and raft log?

arulajmani2y ago

npstr2y ago

Idk to me this looks like a modeling issue of the data. There should be a team table that contains team specific data such as the skill level, then these two queries wouldn't run into any problems.

nickpeterson2y ago

Generally the best way to surmount a hard problem is to solve a different one.

j / k navigate · click thread line to collapse