undefined | Better HN

0 pointsaseipp2y ago0 comments

Note: that isn't the same thing as what the OP describes, at least according to those release notes, but it does fall under the "HMM" umbrella. You still need to specifically allocate your memory with hipMallocManaged before it can be transparently used between the CPU and GPU. Nvidia calls this "unified memory" (and has had it for 10 years now.)

It's confusing, because there are basically three levels of "Heterogeneous Memory Management" in this regard, in order of increasing features and improved programming model:

1. Nothing. You have to both allocate memory with the right allocator (no malloc, no mmap), and also memcpy to/from the host memory to the device, when you want to use it. You still need to "synchronize" with the compute kernel to ensure it completes, before you can see results from a compute kernel.

2. Unified virtual memory. You have to allocate memory with the right allocator (no malloc, no mmap), but after that, you don't need to copy to/from the device memory via special memcpy routines. Memory pages are migrated to/from as you demand them; you can address more memory than your actual GPU has, hence "virtual". You still need to synchronize with the compute kernel to ensure it completes. You can (in theory) LD_PRELOAD a different malloc(2) routine that uses the proper cudaMalloc call or whatever, making all malloc(2) based memory usable for the accelerator, but it doesn't fix systems/libraries/programs that use custom non-malloc(2) allocators or e.g. mmap

3. True heterogeneous memory management. You can use ANY piece of allocated memory, from any memory allocator, and share it with the accelerator, and do not need to copy to/from the device memory. You can use mmap'd pages, custom memory allocators, arbitrary 3rd party libraries, it doesn't really matter. Hell, you can probably set the PROT_WRITE bit on your own executable .text sections and then have the GPU modify your .text from the accelerator. The GPU and CPU have a unified view without any handholding from userspace. You still need to synchronize with the compute kernel to ensure it completes.

Nvidia implements all the features above, while HIP/AMD only implements the first two. Note that AMD has long been involved in various HMM-adjacent work for many years now (HSAIL, various GCC HSA stuff), so it's not like they're coming out of nowhere here. But as far as actual features and "It works today" goes, they're now behind if you're looking at HIP vs CUDA.

0 comments

trws2y ago

I can see how you got here from the release notes, but the conclusions are a bit off. For hardware and kernels that support the full HMM setup with AMD, you get 3 today as long as XNACK is turned on. Systems like Frontier have been using it for some time now.

Also, 2 can be subdivided into systems that implement it by having two allocations, one host one device, and triggering transfers when the GPU might access memory (2.1) and those that implement demand paging (2.2). The HMM support adds demand paging for type 2.2 as well as type 3 on supported hardware, where without it hip had to use either 2.1 or remote PCIE access to provide “unified memory”. Those were dark days, but for current hardware on appropriate kernels appropriately configured, AMD implements memory just as unified as either NVIDIA’s HMM or ATS implementations.

jdoerfert2y ago

This is not true.

3. is supported by AMD on new hardware, e.g., Frontier. See https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#...

aseippOP2y ago

Amazing, thanks for the correction(s)!

j / k navigate · click thread line to collapse