Ask HN: Running production server on M1 mini? | Better HN

Ask HN: Running production server on M1 mini? | Better HN

63 comments

    don’t ask me how

_You_ should be asking you how - there are lots of reasons why this could be happening and knowing which one is important if you're changing stuff.

Based on a "highly parallelizable" application performing better on 8 cores than 32, I'd guess you're running out of something else: memory or disk bandwidth.

Probably the hardest thing to clean up is a codebase where very complicated "optimizations" were built because someone didn't understand some very basic bottlenecks.

I recently inherited an app that makes heavy use of Redis caching because someone didn't first try to optimizing SQL. The complexity that Redis caching adds is insane to maintain compared to spending a few minutes optimizing SQL.

The original poster really needs to hook up a profiler.

Also: having written lots of parallel code: Parallelization isn't a magic way to make things faster. If the codebase is breaking up tasks into lots of tiny tasks that run in parallel, there might be more overhead in parallelization than needed. Sometimes the fastest (performance and implementation) way to parallelize is to keep most of the codebase serial, but only parallelize at the highest level and never share data among operations.

The old... anything but reviewing the execution plan approach... throw more vCPU's at it! Thank god for query store.

gjsman-10004y ago

If his application is running better on 8 instead of 32, that reeks to me of a dependency on single-core performance somewhere. An example of this would be Minecraft, which performs worse on heavily-multi core systems compared to a few fast cores (like M1).

alpaca1284y ago

Also Dwarf Fortress, which runs tons of simulations but is a single-threaded 32bit application, which makes multithreaded performance and RAM beyond ~2GB meaningless.

Matthias2474y ago

+1. They should start profiling their application. If its running on alpine linux e.g. the default memory allocator is extremely bad and would degrade performance - but it could also be tons of other things. Taking random actions without understanding what the current bottleneck is will never be great long term.

groundthrowerOP4y ago

It does not consume much memory but do lots of allocations/deallocations. No disc operations whatsoever.

dragontamer4y ago

M1 has a larger L1 cache, but smaller L3 cache.

It could very well be that your application is hitting a memory pattern that favors larger L1 cache, while the huge L3 cache of EPYC is not useful.

------

If you really wanted to know, you should learn how to use hardware performance counters and check out the instructions-per-clock. If you're around 1 or 2 instructions per clock tick, then you're CPU-bound.

If you're less than that, like 0.1 instructions per clock (ie: 10 clocks per instruction), then you're Cache and/or RAM-bound.

-----

From there, you continue your exploration. You count up L1 cache hits, L2 cache hits, L3 cache hits and cache-misses. IIRC, there are some performance counters that even get into the inter-thread communications (but I forget which ones off the top of my head). Assuming you were cache/ram bound of course (if you were CPU-bound, then check your execution unit utilization instead).

EPYC unfortunately doesn't have very accurate default performance counters, and I'd bet that no one really knows how to use M1 performance counters yet either.

While the default PMC counters of AMD/EPYC are inaccurate (but easy to understand), AMD has a second set of hard-to-understand, but very accurate profiling counters called IBS Profiling: https://www.codeproject.com/Articles/1264851/IBS-Profiling-w...

Still, having that information ought to give you a better idea of "why" your code performs the way it does. You may have to activate IBS-profiling inside of your BIOS before these IBS-profiling tools work.

By default, AMD only has the default performance counters available. So you may have a bit of a struggle juggling the BIOS + profiler to get things working just right, and then you'll absolutely struggle at understanding what the hell you're even looking at once all the data is in.

reacharavindh4y ago

This.

I have dabbled with the AMD & Intel Xeon side of this, but never on MacOS. Do you have an idea how one would go about getting performance counters on MacOS? IPC, L1hit/miss, L2 hitless etc.

groundthrowerOP4y ago

Thanks, appreciated!

gjsman-10004y ago

I’d suggest investigating single core performance. If you have the money, buy an i9-12900K (slightly faster single-core than M1 but much hotter) and do some testing on that. If my theory is correct, performance will be even better.

groundthrowerOP4y ago

We have examined that as well, last week we tried a AMD 5950X which has half the amount of cores but much better single core performance - the result was still at 60% of the Epyc performance

As others have noted this sounds like a contention issue that you should fix by not allocating in your hot path if at all possible. The easiest fix would probably be to try to switch out your global allocator for something like https://github.com/gnzlbg/jemallocator and see if that doesn't give you a nice performance boost.

groundthrowerOP4y ago

Hmm, yes we are already using jemallocator actually

userbinator4y ago

It sounds like you might be running into some sort of contention.

hvgk4y ago

They are perfectly stable machines for running batch jobs. I have had one running a bunch of build automation as a jenkins slave for about 9 months now. Never skipped a beat. It just works and the thing is damn fast.

If it’s doing it offline it’s probably cheaper to buy one and chuck it in your office than borrow one from a cloud provider. The ass end ones are really really really cheap. Much cheaper than just the CPU in an equivalent server machine. If they blow up, just mill down to the apple store and buy another one.

Disclaimers of course: (1) it doesn’t have ECC RAM (2) it doesn’t have redundant power. We ignore (1) and solve (2) by running a prometheus node exporter on it and seeing if it disappears.

Someone12344y ago

No currently offered M1 Mini has redundant fail-over power or storage. Also, without knowing how your cloud provider has cooling setup it is unclear how well it will operate under heavy load for extended periods of time (blade servers are designed for that specific workload and have cooling solutions to match).

My point is: If your workload is time critical, and you cannot afford downtime/outages then it may not be for you. If your workload can afford the time it would take to adopt a new M1 Mini when the old one dies, then maybe?

> No currently offered M1 Mini has redundant fail-over power or storage.

It's kind of funny, but an M1 MacBook does. In fact it comes with a solid >12 hour UPS built-in.

jb19914y ago

I assume you are just talking about the battery, which I suppose is sorta a built-in UPS though I've never heard it referred to as such!

I’ve explicitly considered laptop batteries as UPS’s before in designing certain systems.

Someone12344y ago

So does a data center. Neither one has a redundant PSU.

A MacBook can actually have multiple power supplies plugged in at once and will use the more powerful one. I bet having two of the same wattage would work fine. It also works with the new MacBook Pros with MagSafe.

In fact, if you plug the type-C end of MagSafe cable into the MacBook, it will "charge" itself. USB-PD is pretty great.

It's too bad the Mac mini can't be powered over USB-C though.

Does that include the AWS launched M1 instances last week?

Someone12344y ago

I don't know, Amazon's press releases don't talk very much about how the offering works under the hood.

groundthrowerOP4y ago

Well, it waits for calculations which take about 2 seconds to complete on average - the vast majority of the time it’s idle

gjsman-10004y ago

Could this maybe, someday, be simplified or implemented to a AWS Lambda function?

groundthrowerOP4y ago

We have already tried that and use Lambdas as last point of backup if the other servers aren’t available- however the performance is about 1/50 compared to our current production server

Have you considered looking at Amazon for their ARM offering (Graviton)? I'd be hesistant to use M1 minis for a production workflow as they are not really production grade (lacking ECC memory, not sure how long they are rated to run at high CPU, lack of user replaceable disks, no RAID, etc...).

svacko4y ago

How do you actually compare performance/bechmark the app - are you testing/benchmarking both prod and dev directly on the box itself? I'm thinking, there might be other infrastructure shielding the production like load balancers, proxies and other involved (observability/security tooling running and slowing the prod server) compared to accessing the dev on M1 directly..

crankyadmin4y ago

Knowing what the development language is as well would help a lot - but the first thing you want to do is get some instrumentation on both your 32 Core AMD box and your M1 and compare the two.

The M1 is very fast at doing certain things and your application may just be making good use of the M1 instruction set... both without knowing a bit more its difficult to tell.

If you do not understand why your performance profile is as it is, how do you know next week's patch won't make it perform better on AMD machines suddenly? You should understand your problem before you solve it.

DrBenCarson4y ago

I don’t think any amount of historical or present-state analysis will shed light on next week’s patch.

That being said, it would prepare one to better analyze next week’s patch,

If you know how your application performs and why, you are well equipped for estimating potential impact of patches and hardware. Obviously you still need to profile. In any case complete ignorance should not be the accepted approach.

poulsbohemian4y ago

>our M1 is actually performing better performance wise

I did performance analysis work for a long span of my career. While I'm reading between the lines of what you wrote, my first question is - what do you mean by performing better? As in, is it somehow able to process more of these tasks over a given timeframe? If so, I'd want to understand more about the workloads you are running to make sure it's a proper comparison.

There's a whole lot more questions we need to answer here to understand the results you are seeing before we can have any kind of discussion of whether M1s would be "better."

marban4y ago

I have one sitting on my desk that generates videos 24/7 and hasn't been down in a year.

https://imgur.com/a/VAxpGCL

nobbis4y ago

We use MacStadium's M1 mini servers for Metascan's photogrammetry batch processing. They've only been running a few months, but no downtime yet and I'm impressed with MacStadium's customer support, responsiveness, and price.

toast04y ago

Unless it's changed recently, OS X has essentially no protection from synfloods. The TCP stack predates FreeBSD's syncache, and it was never ported. It doesn't have syncookies either. The pf port's synproxy stuff doesn't seem to work either.

You've got to put some sort of firewall or something in front, don't let it accept tcp connections directly. You might be OK, but not great if you just set the listen queue really short; at least that should prevent the machine from falling over when it's flooded, but without syncookies, chances are you won't be able to make new connections either.

DarthNebo4y ago

Feels like your provisioned disk or IOPS could be the missing factor instead of core counts.

groundthrowerOP4y ago

We do not do any disk operations at all

DarthNebo4y ago

Ohkk, just see how the network stats on both setups are. How are you testing the remote env? Is the traffic from local env or same cloud env?

gjsman-10004y ago

No - but I can give a few suggestions.

One would be to look, if you haven’t, at MacStadium and what they’ve got there. You can get an M1 Mini there and it will be run by experts who know all about using M1 minis for servers. Considering your application is highly parallelizable, this would also make it easy to upgrade to the M1 Pro with double the performance cores down the line.

Secondly, if your application is running better on M1, that reeks of an application which is somehow greatly benefiting at single-threaded performance somewhere, which the M1 excels and the Epyc is poor at. That probably needs some investigation.

errcorrectcode4y ago

Terrible idea. I supported a dozen Xserves back in the day. They were crap because they weren't designed for production use. They used nonswappable, commodity retail IDE drives not meant for 100% duty cycle operation. Fixed power supplies. Real enterprise servers were cheaper.

Mac minis don't have redundant power or ECC. You might as well run a bunch of RPis or PICs. Get yourself some real enterprise servers or rent some via a VPS.

Disclaimer: I use a Mac mini as my living room HTPC. I wouldn't run anything real on it. That's what I have a 96 thread EPYC virtualized box for.

caeril4y ago

No personal experience other than a slightly different experience running production services (involving money!) on another box without ECC DRAM (to save money!) and experiencing random permission flags flips and actual balance/amount flips. Only a small handful over many years, but it does happen, and when it matters, it REALLY matters.

My advice is to always use ECC DRAM in production unless you're serving cat photos, porn, social media posts, or other societally useless applications. For anything that actually matters, please use ECC.

groundthrowerOP4y ago

Yes this is one concern. Are you sure it was a result of using non ECC mem and how did you find out it was because of that?

caeril4y ago

We could never be absolutely sure, due to the true Heisenbug nature of the behavior, but after tons of code audits and the observation after reverse proxy traffic analysis that it only occurred on processing by the non-ECC hosts, and never on the ECC hosts, that it was the most likely culprit.

The fact that the errors were single bit errors also strongly pointed in that direction.

skw-hn4y ago

Scaleway is also providing M1 mac minis. The price is around 0.10€/hr which is quite cheaper.

As per the stability, my scaleway m1 has never had any issues. works just fine for some CI.

I'd be curious to know if your application scales even further onto an M1 Pro/Max. If that's the case, then something about Apple silicon makes your application scream.

throwaway4good4y ago

Mac OS X will require updates from time to time. Otherwise they will run 24/7 with no problem. You can consider building a hybrid setup where you leave the stuff that requires no / little downtime at your cloud provider.

maksimpiriyev4y ago

I was thinking the same, as switching to M1 server,and also next version of M1 Mac minis will probably be x2 faster than the current one, so next year you buy mac mini will be double benefit :)

tyingq4y ago

Have you tried using taskset or similar to force the production application onto fewer cores? Perhaps something about thread/ipc/locking overhead?

usefulcat4y ago

This is a good thing to check. However, do be aware that a lot of apps check the number of physical processors in the system rather than the CPU affinity mask for the process, even though the latter is almost always what they ought to be using.

BizarroLand4y ago

"Dedicated instance" makes me think it's a cloud-based system. Are you actually getting what you're paying for?

p_papageorgiou4y ago

Any more details about the workload type of your application? Single threaded / Multithreaded / AI etc

Andys4y ago

There are now 3 generations of AMD Epyc.

The latest is noticeably quicker than the older ones, and competitive with the M1.

awinter-py4y ago

can you describe the program? just broad strokes about language, framework, what kind of traffic it's receiving?

groundthrowerOP4y ago

It’s written in Rust and uses Rayon to a big extent. It’s receiving data to crunch maybe once every 5 minutes

awinter-py4y ago

the msg from dragontamer to set up a profiler seems like one approach to diagnose this

and also from joshdev to try aws graviton, which is also arm based but potentially more suited for cloud hosting than an m1

if you figure this out, definitely write it up -- very cool tech blog topic, most people never get to debug cpu architecture firsthand

j / k navigate · click thread line to collapse