undefined | Better HN

0 pointsdormando7y ago0 comments

Please don't try to dial this back to win. I showed you how things work, go ahead and fiddle with them as you want.

1) sure, 100b, but that will just make it easier for the CPU version to hit the packet rate limit. I dialed it down to show just how fast the key rate is. Your entire proposal was that CPU bottlenecked the NIC, and it does not. Also, most people have 100b keys, nevermind the values.

2) 1:1 was never realistic. It's not even remotely realistic; as I said earlier 5:1 would be pessimistic. In reality the instances which have get rates in the millions tend to have 100:1 or better ratios due to the nature of the data they're caching.

Yes, the newer LRU algorithm doesn't grab LRU locks on the read path, so it'll scale with the number of CPU cores. As I said in earlier comments, the sets don't currently scale, especially if you're hammering the same LRU (which is again, unrealistic). If you just do a pure set load you'll land somewhere between 900k and 1.5m ops/sec.

3) I did both single-get-pipelined and packet-pipelined benchmarks; also absolutely not. Clients are designed to use the multiget mode when multiple keys are being fetched from the same server. This benefit is lost with the binary protocol (which will be fixed at some point).

4) Try an mget with 16, it won't be too far off, though you might have to add one more mc-crusher thread.

In your last test, you're simply overloading it with sets. If you want to mislead people with a test like this, go ahead; but I'll point it out.

3.5M/s isn't too bad.

Memcached really isn't a great target for your sort of work. I love the idea of FPGA offload, but trying to advertise your thing as superior by making up your own rules is going to get called out.

1) The popularity of redis is absolutely damning in general. if people are okay with the performance of a single CPU database with all-over-the-map latency profiles, the odds of you finding enough customers with extremely high rate memcached pools to sustain a business are essentially zero. You'd be solely tricking people who think they need it.

2) You are not facebook. Nobody is facebook but facebook. 100b is not representative. It's not even representative of facebook's load.

What's worse, even for a more common case, if 99% of requests are 100b, the average size of an item might be 8k. Which doesn't mean that there are a bunch around 8k, but there could be a few thousand items that are 50k-500k+ in size, getting hit 1% of the time, or even 0.1% of the time.

500x the bandwidth of a 100b request for the same processing overhead. It's almost always something they need: a request might fetch a couple hundred items from memcached, with just a couple of them being large.

This ends up making RAM be the greatest expense in the system. If so few users really need this performance, and the newer versions of memcached have a much higher perf ceiling, the extra features it has to drive down RAM usage are more valuable.

The best cost/power savings most users can do is find a way to get more RAM attached to fewer CPU cores: to be frank a r4.4xlarge would suit better with 8 cores. Or find ways push larger cold values into flash, freeing up RAM for those 100b values to be served quickly.

0 comments

andrewcanis7y ago

Nothing was dialed back. My previous post just confirmed that our elasticache was not misconfigured. I used the latest github memcached with 14 worker threads as you suggested and your benchmark script gives the same results that we reported for the r4.4xlarge.

1) Actually your earlier test confirms our point that the CPU does not saturate the 10Gbps network with small value sizes. For example, in your 10B value example you got 15M req/sec with 16 cores. This rate is 1.2Gbps (15M x 10B * 8), well below the network limit of 10Gbps. The FPGA would still be ~8X faster at line rate.

2) Regardless of the get/set ratio the FPGA will hit line rate. However, thanks for pointing out that GET requests are much faster in the latest version of memcached, we didn't know that. Looks like the FPGA is only 3X faster for this 10:1 Get to Set workload.

3) Users would prefer if there was no pipelining at all. We only used pipelining to get around packet/sec limitations on AWS. If the FPGA was connected directly to the network we could hit line rate without any pipelining.

We aren't trying to mislead, we are just showing what's possible with the FPGA on AWS: line-rate processing of incoming requests at close to 10Gbps. The cool part as I mentioned is that the FPGA is still under-utilized so we could add encryption without affecting requests/sec at all because hardware cores execute in parallel. Another idea is to compress the data on the fly.

Agreed that RAM is the expensive part, which is why we picked a CPU instance that had similar RAM to the FPGA instance. Yes we have heard of users caching data to SSD to save cost.

j / k navigate · click thread line to collapse

0 comments

andrewcanis7y ago

Agreed that RAM is the expensive part, which is why we picked a CPU instance that had similar RAM to the FPGA instance. Yes we have heard of users caching data to SSD to save cost.

j / k navigate · click thread line to collapse