undefined | Better HN

0 pointsandrewcanis7y ago0 comments

I'm still seeing similar results (~1M req/sec) after compiling your latest version of memcached from github and running with 16 worker threads. I just spun up two r4.4xlarge instances (one for client and one for the memcached server). I'm using memtier_benchmark with pipelining of 16 requests, 100B values, 10:1 get/set ratio. I compiled mc-crusher but you'll have to let me know the command to run because the readme wasn't clear.

One main constraint here is that we are using AWS virtual machine instances on the cloud. My guess is your previous experience is with physical servers. The FPGA performance is also significantly better when you can use the physical board with a direct ethernet connection, pipelining isn't required in this case the FPGA can handle minimum sized ethernet packets at line rate.

Another question, in your experience is compression/encryption used much with memcached? Because this is another area where the FPGA can compute much faster.

0 comments

dormando7y ago

mis-threaded my response below (didn't have a reply button?), so see that too.

Just signed up for a personal AWS account and manually started an r4.4xlarge for target and c5.4xlarge for source (same CPU's and networking capability?, but it wasn't allowing me to just start two r4.4xlarge...).

got it up to 15M hits/sec for pure mget test.

results: https://gist.github.com/dormando/910134e85279710b970bd2c8af8...

andrewcanisOP7y ago

Thanks for the details on how to use your benchmark script and for taking the time to investigate this. I hadn’t heard of your benchmark before and mc-crusher seems to work a bit differently than memtier_benchmark.

First a few significant differences:

1) Your value size is 10B which completely changes the results. Let’s keep the value size at 100B, which is more realistic.

2) The ratio of gets to sets significantly affects the requests per sec. We were assuming 1:1 ratio when we did our measurements. Increasing the percentage of gets really speeds up req/sec. We didn’t observe this effect on elasticache. Is this a recent improvement in the github version of memcached?

3) Your benchmark is using multiple keys in the same get command. What memtier does is pipeline multiple get commands each with one key. This seems more realistic.

4) We pipelined 16 get commands per packet while your configuration had 50 keys per get command.

I was able to reproduce the same setup as we had with ~1.2M req/sec with your mc-crusher benchmark using the following config. This has 1:1 get to set ratio with pipeline 16 and value size 100B.

send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1

I used the github memcached on an r4.4xlarge. I ran memcache-top on the server instance to measure the requests per second, showing about 750k gets/sec and 600k sets/sec.

With a ratio of 10:1 gets to sets I’m seeing about 3.5M req/sec which seems better than elasticache.

dormando7y ago

Please don't try to dial this back to win. I showed you how things work, go ahead and fiddle with them as you want.

1) sure, 100b, but that will just make it easier for the CPU version to hit the packet rate limit. I dialed it down to show just how fast the key rate is. Your entire proposal was that CPU bottlenecked the NIC, and it does not. Also, most people have 100b keys, nevermind the values.

2) 1:1 was never realistic. It's not even remotely realistic; as I said earlier 5:1 would be pessimistic. In reality the instances which have get rates in the millions tend to have 100:1 or better ratios due to the nature of the data they're caching.

Yes, the newer LRU algorithm doesn't grab LRU locks on the read path, so it'll scale with the number of CPU cores. As I said in earlier comments, the sets don't currently scale, especially if you're hammering the same LRU (which is again, unrealistic). If you just do a pure set load you'll land somewhere between 900k and 1.5m ops/sec.

3) I did both single-get-pipelined and packet-pipelined benchmarks; also absolutely not. Clients are designed to use the multiget mode when multiple keys are being fetched from the same server. This benefit is lost with the binary protocol (which will be fixed at some point).

4) Try an mget with 16, it won't be too far off, though you might have to add one more mc-crusher thread.

In your last test, you're simply overloading it with sets. If you want to mislead people with a test like this, go ahead; but I'll point it out.

3.5M/s isn't too bad.

Memcached really isn't a great target for your sort of work. I love the idea of FPGA offload, but trying to advertise your thing as superior by making up your own rules is going to get called out.

1) The popularity of redis is absolutely damning in general. if people are okay with the performance of a single CPU database with all-over-the-map latency profiles, the odds of you finding enough customers with extremely high rate memcached pools to sustain a business are essentially zero. You'd be solely tricking people who think they need it.

2) You are not facebook. Nobody is facebook but facebook. 100b is not representative. It's not even representative of facebook's load.

What's worse, even for a more common case, if 99% of requests are 100b, the average size of an item might be 8k. Which doesn't mean that there are a bunch around 8k, but there could be a few thousand items that are 50k-500k+ in size, getting hit 1% of the time, or even 0.1% of the time.

500x the bandwidth of a 100b request for the same processing overhead. It's almost always something they need: a request might fetch a couple hundred items from memcached, with just a couple of them being large.

This ends up making RAM be the greatest expense in the system. If so few users really need this performance, and the newer versions of memcached have a much higher perf ceiling, the extra features it has to drive down RAM usage are more valuable.

The best cost/power savings most users can do is find a way to get more RAM attached to fewer CPU cores: to be frank a r4.4xlarge would suit better with 8 cores. Or find ways push larger cold values into flash, freeing up RAM for those 100b values to be served quickly.

1 more reply

j / k navigate · click thread line to collapse

0 comments

dormando7y ago

mis-threaded my response below (didn't have a reply button?), so see that too.

got it up to 15M hits/sec for pure mget test.

results: https://gist.github.com/dormando/910134e85279710b970bd2c8af8...

andrewcanisOP7y ago

First a few significant differences:

1) Your value size is 10B which completely changes the results. Let’s keep the value size at 100B, which is more realistic.

3) Your benchmark is using multiple keys in the same get command. What memtier does is pipeline multiple get commands each with one key. This seems more realistic.

4) We pipelined 16 get commands per packet while your configuration had 50 keys per get command.

I was able to reproduce the same setup as we had with ~1.2M req/sec with your mc-crusher benchmark using the following config. This has 1:1 get to set ratio with pipeline 16 and value size 100B.

I used the github memcached on an r4.4xlarge. I ran memcache-top on the server instance to measure the requests per second, showing about 750k gets/sec and 600k sets/sec.

With a ratio of 10:1 gets to sets I’m seeing about 3.5M req/sec which seems better than elasticache.

dormando7y ago

Please don't try to dial this back to win. I showed you how things work, go ahead and fiddle with them as you want.

4) Try an mget with 16, it won't be too far off, though you might have to add one more mc-crusher thread.

In your last test, you're simply overloading it with sets. If you want to mislead people with a test like this, go ahead; but I'll point it out.

3.5M/s isn't too bad.

Memcached really isn't a great target for your sort of work. I love the idea of FPGA offload, but trying to advertise your thing as superior by making up your own rules is going to get called out.

2) You are not facebook. Nobody is facebook but facebook. 100b is not representative. It's not even representative of facebook's load.

1 more reply

j / k navigate · click thread line to collapse