Valkey achieved one million RPS 6 months after forking from Redis (opens in new tab)

(valkey.io)

113 pointsreconditerose1y ago37 comments

37 comments

One thing the article doesn’t mention is how they figured out that waiting for external memory access is the bottleneck. Are there any profiling tools available that would tell the developer that the cpu is waiting for external memory x% of the time?

khold_stare1y ago

The other comments have mentioned the tools. On linux, there's good old perf.

There's `perf stat` that uses CPU performance counters to give you a high-level view if your workload is stalling due to waiting on memory: https://stackoverflow.com/questions/22165299/what-are-stalle.... However it won't tell you exactly where the problem is, just that there is a problem. You can do `perf record` on your process, and then running `perf report` on the generated data. You'll see what functions and what lines/instructions are taking the most time. Most of the time it will be pretty obvious that it's a memory bottleneck because it will be some kind of assignment or lookup.

If you're using an intel processor, VTune is extremely detailed. Here's a nice article from Intel on using it: https://www.intel.com/content/www/us/en/docs/vtune-profiler/... . You'll see one of the tables in the articles lists functions as "memory bound" - most time is spent waiting on memory, as opposed to executing computations.

dtouitou1y ago

I am using perf. On Graviton, the event you want to look for is LLC-load-misses, which tracks last-level cache (LLC) misses (i.e., external memory accesses). The command "perf record -e LLC-load-misses -t $(pgrep valley-server)" will record the number of LLC misses per instruction during the execution of the valkey-server main thread. Please note that LLC-load-misses events are not collectable when running on instances that only use a subset of the processor's core. "perf list" provides the events that can be collected on your machine

charleslmunger1y ago

Linux perf would do this if you profile for "backend stalls".

throwup2381y ago

Intel VTune Profiler can do that.

ivzhh1y ago

Not directly related to the question. A few years ago, I read about that B-tree is also recommended for in-memory operations because the latency between CPU/memory is high now.

MobiusHorizons1y ago

This is really cool work! I am surprised to see this level of tuning without using cache profiling or other performance counters to identify the bottleneck and quantify the improvement.

remram1y ago

They probably did.

reconditeroseOP1y ago

Haha, we definitely used profiling tools. :) Perf is great and definitely the first tool to use. I know vtune has been used as well.

secondcoming1y ago

Redis' biggest flaw is its single threaded design. We end up having to run separate redis processes on each core and have client side sharding. We're lucky that our data allows this.

We experiment with KeyDB too but I'm not sure what state that project is in.

hi-v-rocknroll1y ago

KeyDB is awesome. Works great.

PeterZaitsev1y ago

Great to see Valkey team is making a progress well beyond keeping old Redis version Security Patched.

jacobgorm1y ago

Who in their right mind uses linked lists for a database style workload? Try doing this with arrays to get a reasonable baseline.

dtouitou1y ago

Who in their right mind would use linked lists in linux kernel... hehe

jacobgorm1y ago

That is not a database style workload like the one in the example with millions of elements that all had to be scanned.

dtouitou1y ago

A server with 100K TCP connections isn't much different, and you might be surprised (or maybe not) by how extensively dynamic data structures are used in the TCP stack implementation. I'm not sure all of them can be made 'cache-friendly,' and embedding all layers together would make the code unmaintainable. Intensive access to external memory is inevitable when accessing a large enough number of unrelated states.

throwaway815231y ago

Nice, how does that compare with Pedis, which was written in C++ several years ago? It's an incomplete Redis lookalike that isn't getting current development, but it uses Seastar, the same parallelism framework as ScyllaDB.

https://github.com/fastio/1store

estebarb1y ago

It is a completely different kind of parallelism. Seastar makes easier to leverage parallelism across many cores. Here Valkey is leveraging instruction level parallelism within a single core.

A single CPU core actually can execute more than one instruction at a time, by leveraging out of order execution. The trick for leveraging out of order execution is avoiding having data dependencies locally. By swapping the iteration order, they allow the CPU core to continue with the next iteration before the previous has finished. Why? Because there is no data dependency anymore!

I haven't profiled that code, but I guess that now the bottleneck would be the sum. But it doesn't matter, as accesing a register is the fastest operation. Accesing the memory cache is slower, and accessing RAM is even slower.

justin661y ago

I'm sure if they approach enough people offering to show them their Pedis, they will receive some feedback.

klohto1y ago

Ain’t nobody adopting anything named Pedis

whalesalad1y ago

I’m sorry, pedis?

mirekrusin1y ago

I wonder why it didn’t get traction?

throwaway815231y ago

It looks to me like they have changed the name to 1store but haven't updated the web pages completely.

andai1y ago

Parallel Redis!

ddtaylor1y ago

Could have called it Threadis

yunohn1y ago

What would be the point of comparing a popular active fork to an unheard-of non-maintained one?

oneplane1y ago

Usually the point is the same as with a very-heard-of well-maintained one: how did they do it?

Algorithms and methodologies don't require being well-known or being well-maintained to be valid and useful comparisons.

yunohn1y ago

I would argue that doing a comparison requires the author to actually have heard of the other option first. The world has infinite things, we can only focus on so much.

bastawhiz1y ago

I mean, it's sort of what it says on the tin: it's parallelized. Valkey is essentially just an optimized Redis, which is largely single threaded. Pedis is multithreaded. That comes with tradeoffs under certain workloads, which you don't even have to get too creative to imagine.

1 more reply

tayo421y ago

The interleave thing isnt intuitive to me.

The problem with linked lists is the memory address of nodes isn't necessarily contiguous because of malloc and the value could be NULL? Why does interleave loop make it faster for the cpu? It still a linked list, arbitrary memory, could be NULL? Not sure what im missing here?

latch1y ago

It's a bit like row vs column store.

If there are two lists, in the first example, you're doing:

    a1 -> b1 -> c1 -> d1
    a2 -> b2 -> c2

In the 2nd example, you're doing

    [a1, a2]
    [b1, b2]
    [c1, c2]
    [d1]

Visiting the same number of nodes, but because the nodes are referenced from an array, when you load `a1` you're [probably] also going to load `a2` into the cache.

lossolo1y ago

> Visiting the same number of nodes, but because the nodes are referenced from an array, when you load `a1` you're [probably] also going to load `a2` into the cache.

Yep, the problem with a linked list is that you don't know the address of the next pointer, it could be allocated anywhere. However, if you have a continuous array, when you read the first item, you're also fetching the next items in the cache line.

j / k navigate · click thread line to collapse

37 comments

nonane1y ago

khold_stare1y ago

The other comments have mentioned the tools. On linux, there's good old perf.

dtouitou1y ago

charleslmunger1y ago

Linux perf would do this if you profile for "backend stalls".

throwup2381y ago

Intel VTune Profiler can do that.

ivzhh1y ago

Not directly related to the question. A few years ago, I read about that B-tree is also recommended for in-memory operations because the latency between CPU/memory is high now.

MobiusHorizons1y ago

This is really cool work! I am surprised to see this level of tuning without using cache profiling or other performance counters to identify the bottleneck and quantify the improvement.

remram1y ago

They probably did.

reconditeroseOP1y ago

Haha, we definitely used profiling tools. :) Perf is great and definitely the first tool to use. I know vtune has been used as well.

secondcoming1y ago

Redis' biggest flaw is its single threaded design. We end up having to run separate redis processes on each core and have client side sharding. We're lucky that our data allows this.

We experiment with KeyDB too but I'm not sure what state that project is in.

hi-v-rocknroll1y ago

KeyDB is awesome. Works great.

PeterZaitsev1y ago

Great to see Valkey team is making a progress well beyond keeping old Redis version Security Patched.

jacobgorm1y ago

Who in their right mind uses linked lists for a database style workload? Try doing this with arrays to get a reasonable baseline.

dtouitou1y ago

Who in their right mind would use linked lists in linux kernel... hehe

jacobgorm1y ago

That is not a database style workload like the one in the example with millions of elements that all had to be scanned.

dtouitou1y ago

throwaway815231y ago

https://github.com/fastio/1store

estebarb1y ago

It is a completely different kind of parallelism. Seastar makes easier to leverage parallelism across many cores. Here Valkey is leveraging instruction level parallelism within a single core.

justin661y ago

I'm sure if they approach enough people offering to show them their Pedis, they will receive some feedback.

klohto1y ago

Ain’t nobody adopting anything named Pedis

whalesalad1y ago

I’m sorry, pedis?

mirekrusin1y ago

I wonder why it didn’t get traction?

throwaway815231y ago

It looks to me like they have changed the name to 1store but haven't updated the web pages completely.

andai1y ago

Parallel Redis!

ddtaylor1y ago

Could have called it Threadis

yunohn1y ago

What would be the point of comparing a popular active fork to an unheard-of non-maintained one?

oneplane1y ago

Usually the point is the same as with a very-heard-of well-maintained one: how did they do it?

Algorithms and methodologies don't require being well-known or being well-maintained to be valid and useful comparisons.

yunohn1y ago

I would argue that doing a comparison requires the author to actually have heard of the other option first. The world has infinite things, we can only focus on so much.

bastawhiz1y ago

1 more reply

tayo421y ago

The interleave thing isnt intuitive to me.

latch1y ago

It's a bit like row vs column store.

If there are two lists, in the first example, you're doing:

    a1 -> b1 -> c1 -> d1
    a2 -> b2 -> c2

In the 2nd example, you're doing

    [a1, a2]
    [b1, b2]
    [c1, c2]
    [d1]

Visiting the same number of nodes, but because the nodes are referenced from an array, when you load `a1` you're [probably] also going to load `a2` into the cache.

lossolo1y ago

> Visiting the same number of nodes, but because the nodes are referenced from an array, when you load `a1` you're [probably] also going to load `a2` into the cache.

j / k navigate · click thread line to collapse