We just noticed locally on our dev environment that our M1 is actually performing better performance wise (don’t ask me how).
We are now considering switching our production servers to M1 minis which is also offered by our cloud provider. Do you have any experience on running M1s / Mac in an production environment regarding stability / uptime etc?
Edit: it’s a Rust application which uses the Rayon crate. The application gets on average one request a minute which crunches some numbers for an average of 2 seconds - so it’s mostly idle. No disk IO.
don’t ask me how
_You_ should be asking you how - there are lots of reasons why this could be happening and knowing which one is important if you're changing stuff.Based on a "highly parallelizable" application performing better on 8 cores than 32, I'd guess you're running out of something else: memory or disk bandwidth.
I recently inherited an app that makes heavy use of Redis caching because someone didn't first try to optimizing SQL. The complexity that Redis caching adds is insane to maintain compared to spending a few minutes optimizing SQL.
The original poster really needs to hook up a profiler.
Also: having written lots of parallel code: Parallelization isn't a magic way to make things faster. If the codebase is breaking up tasks into lots of tiny tasks that run in parallel, there might be more overhead in parallelization than needed. Sometimes the fastest (performance and implementation) way to parallelize is to keep most of the codebase serial, but only parallelize at the highest level and never share data among operations.
It could very well be that your application is hitting a memory pattern that favors larger L1 cache, while the huge L3 cache of EPYC is not useful.
------
If you really wanted to know, you should learn how to use hardware performance counters and check out the instructions-per-clock. If you're around 1 or 2 instructions per clock tick, then you're CPU-bound.
If you're less than that, like 0.1 instructions per clock (ie: 10 clocks per instruction), then you're Cache and/or RAM-bound.
-----
From there, you continue your exploration. You count up L1 cache hits, L2 cache hits, L3 cache hits and cache-misses. IIRC, there are some performance counters that even get into the inter-thread communications (but I forget which ones off the top of my head). Assuming you were cache/ram bound of course (if you were CPU-bound, then check your execution unit utilization instead).
EPYC unfortunately doesn't have very accurate default performance counters, and I'd bet that no one really knows how to use M1 performance counters yet either.
While the default PMC counters of AMD/EPYC are inaccurate (but easy to understand), AMD has a second set of hard-to-understand, but very accurate profiling counters called IBS Profiling: https://www.codeproject.com/Articles/1264851/IBS-Profiling-w...
Still, having that information ought to give you a better idea of "why" your code performs the way it does. You may have to activate IBS-profiling inside of your BIOS before these IBS-profiling tools work.
By default, AMD only has the default performance counters available. So you may have a bit of a struggle juggling the BIOS + profiler to get things working just right, and then you'll absolutely struggle at understanding what the hell you're even looking at once all the data is in.
If it’s doing it offline it’s probably cheaper to buy one and chuck it in your office than borrow one from a cloud provider. The ass end ones are really really really cheap. Much cheaper than just the CPU in an equivalent server machine. If they blow up, just mill down to the apple store and buy another one.
Disclaimers of course: (1) it doesn’t have ECC RAM (2) it doesn’t have redundant power. We ignore (1) and solve (2) by running a prometheus node exporter on it and seeing if it disappears.
My point is: If your workload is time critical, and you cannot afford downtime/outages then it may not be for you. If your workload can afford the time it would take to adopt a new M1 Mini when the old one dies, then maybe?
It's kind of funny, but an M1 MacBook does. In fact it comes with a solid >12 hour UPS built-in.
The M1 is very fast at doing certain things and your application may just be making good use of the M1 instruction set... both without knowing a bit more its difficult to tell.
That being said, it would prepare one to better analyze next week’s patch,
I did performance analysis work for a long span of my career. While I'm reading between the lines of what you wrote, my first question is - what do you mean by performing better? As in, is it somehow able to process more of these tasks over a given timeframe? If so, I'd want to understand more about the workloads you are running to make sure it's a proper comparison.
There's a whole lot more questions we need to answer here to understand the results you are seeing before we can have any kind of discussion of whether M1s would be "better."
You've got to put some sort of firewall or something in front, don't let it accept tcp connections directly. You might be OK, but not great if you just set the listen queue really short; at least that should prevent the machine from falling over when it's flooded, but without syncookies, chances are you won't be able to make new connections either.
One would be to look, if you haven’t, at MacStadium and what they’ve got there. You can get an M1 Mini there and it will be run by experts who know all about using M1 minis for servers. Considering your application is highly parallelizable, this would also make it easy to upgrade to the M1 Pro with double the performance cores down the line.
Secondly, if your application is running better on M1, that reeks of an application which is somehow greatly benefiting at single-threaded performance somewhere, which the M1 excels and the Epyc is poor at. That probably needs some investigation.
Mac minis don't have redundant power or ECC. You might as well run a bunch of RPis or PICs. Get yourself some real enterprise servers or rent some via a VPS.
Disclaimer: I use a Mac mini as my living room HTPC. I wouldn't run anything real on it. That's what I have a 96 thread EPYC virtualized box for.
My advice is to always use ECC DRAM in production unless you're serving cat photos, porn, social media posts, or other societally useless applications. For anything that actually matters, please use ECC.
The fact that the errors were single bit errors also strongly pointed in that direction.
As per the stability, my scaleway m1 has never had any issues. works just fine for some CI.
The latest is noticeably quicker than the older ones, and competitive with the M1.
and also from joshdev to try aws graviton, which is also arm based but potentially more suited for cloud hosting than an m1
if you figure this out, definitely write it up -- very cool tech blog topic, most people never get to debug cpu architecture firsthand