This is the part of CUDA alternatives always miss when their models only support C and some C++ subset.
My feeling is that unified memory and on-demand paging introduced with Pascal? was mainly about making it easier to onboard existing applications (e.g., HPC codes etc) to the GPU a bit at a time with less problem. For writing a GPU application from scratch, I don't think it makes much sense (unless the granluarity of the data that you are moving around is really tiny and/or you can't predict what you would need in advance on CPU or GPU).
One way to think about this is to observe that HMM is effectively a software-based way of providing the same programming model as an NVIDIA Grace Hopper Superchip."
1) I am curious what the AMD equivalent of nVidia's HMM is, or will be...
2) I am curious if software will be able to be written with HMM (or some higher level abstraction API) such that HMM enabled software will also function on an AMD or other 3rd party GPU...
AMD has much the same variations as nvidia here, some details at https://github.com/amd/amd-lab-notes/blob/release/mi200-memo.... The single memory systems are called APUs. The internet thinks the MI300 (in El Capitan) is one of those. The games consoles and mobile chips are too.
I'm not sure what the limits are in terms of arbitrary heterogenous execution if you want to push the boundaries, e.g. can you JIT amdgpu code into memory you got from mmap and have one of the GPU execution units branch to it? I don't see why not, but haven't tried it.
In principle I suppose a page should be able to migrate between nvidia and amdgpu hardware on a machine containing GPUS from both vendors, though that isn't likely to be a well tested path.
AMD added HMM support in ROCm 5.0 according to this: https://github.com/RadeonOpenCompute/ROCm/blob/develop/CHANG...
It's confusing, because there are basically three levels of "Heterogeneous Memory Management" in this regard, in order of increasing features and improved programming model:
1. Nothing. You have to both allocate memory with the right allocator (no malloc, no mmap), and also memcpy to/from the host memory to the device, when you want to use it. You still need to "synchronize" with the compute kernel to ensure it completes, before you can see results from a compute kernel.
2. Unified virtual memory. You have to allocate memory with the right allocator (no malloc, no mmap), but after that, you don't need to copy to/from the device memory via special memcpy routines. Memory pages are migrated to/from as you demand them; you can address more memory than your actual GPU has, hence "virtual". You still need to synchronize with the compute kernel to ensure it completes. You can (in theory) LD_PRELOAD a different malloc(2) routine that uses the proper cudaMalloc call or whatever, making all malloc(2) based memory usable for the accelerator, but it doesn't fix systems/libraries/programs that use custom non-malloc(2) allocators or e.g. mmap
3. True heterogeneous memory management. You can use ANY piece of allocated memory, from any memory allocator, and share it with the accelerator, and do not need to copy to/from the device memory. You can use mmap'd pages, custom memory allocators, arbitrary 3rd party libraries, it doesn't really matter. Hell, you can probably set the PROT_WRITE bit on your own executable .text sections and then have the GPU modify your .text from the accelerator. The GPU and CPU have a unified view without any handholding from userspace. You still need to synchronize with the compute kernel to ensure it completes.
Nvidia implements all the features above, while HIP/AMD only implements the first two. Note that AMD has long been involved in various HMM-adjacent work for many years now (HSAIL, various GCC HSA stuff), so it's not like they're coming out of nowhere here. But as far as actual features and "It works today" goes, they're now behind if you're looking at HIP vs CUDA.
They’ve really left this area wide open for over a decade now when it’s been extremely clear this is where the market was going.
Their GPU and GPU compute story is a mess, because rocm has the most confusing compatibility story possible . They’ve been late to compute accelerators as well.
I don’t think there’ll be any abstraction layers either. The community as a whole is more than happy to be single vendor. AMD has shown they can’t build compute stacks, not because of technology reasons but purely long term decisions. The community therefore won’t do it for them.
You're not helping anything by going off on some rant based on an assumption and falsehood - this sort of comment is exactly the sort of thing the phrase "FUD" is used to describe.
What Nvidia is doing here sounds similar. Does linux provide such primitives now?
(Not exactly "now", but when the software is recompiled / ported to this)
This is not a change to "features" but a change to the programming model. You now never need to ever write cudaMalloc or cudaFree, you can just use any allocator or tool. This means more off the shelf code will just work when used with CUDA. So now your io_uring buffers can be shared with the GPU trivially, for example, or mmap'd pages that a library gave you, or whatever.
The programming model is one of the things Nvidia does significantly better than any competitor. Single source model + HMM is a big step up from something like OpenCL in productivity and correctness.
On Grace Hopper chips, HMM is granular down to the cache line (64 bytes); on x86 systems I believe they said it's (of course) a 4k page granularity.
The only thing that's been added is bank-groups in DDR4 IMO. But all you need to know is that modern RAM is maybe 16x to 32x way parallel per stick. The interface operates are faster than RAM can respond in time, so an "Optimal" CPU will list off 32x to 64x (32x for the first stick, 32x for the 2nd stick) read/write commands before the first command ever responds.
Understanding that mechanism is what that document is about (how CPUs coalesce memory and parallelizes requests).
----------------
GPUs have one additional coalesce layer given channel vs bank conflicts, and all that noise. But most GPU manuals (be they NVidia or AMD) will cover those details.