> You are also explicitly saying that you want device memory by specifying DEVICE_LOCAL_BIT. There's no difference.
There is. One is a simple malloc call, the other uses arguments with numerous combinations of usage flags which all end up doing exactly the same, so why do thy even exist.
> You _have_ to be able to allocate both on host and device.
cuMemAlloc and cuMemAllocHost, as mentioned before.
> Because there's such a thing as accessing GPU memory from the host
Never had the need for that, just cuMemcpyHtoD and DtoH the data. Of course host-mapped device memory can continue to exist as a separate, more cumbersome API. The 256MB limit is cute but apparently not relevant im Cuda where I've been memcpying buffers with GBs in size between host and device for years.
> No, because if that's the only way to allocate memory, how are you going to allocate staging buffers for the CPU to write to?
With the mallocHost counterpart.
cuMemAllocHost, so a theoretic vkMallocHost, gives you pinned host memory where you can prep data before sending it to device with cuMemcpyHtoD.
> This is how you end up with a zillion flags.
Apparently only if you insist on mapped/host-visible memory. This and usage flags never ever come up in Cuda where you just write to the host buffer and memcpy when done.
> This is the reason I keep bringing UMA up but you keep brushing it off.
Yes I think I now get why keep bringing up UMA - because you want to directly access buffers between host or device via pointers. That's great, but I don't have the need for that and I wouldn't trust the performance behaviour of that approach. I'll stick with memcpy which is fast, simple, has fairly clear performance behaviours and requires none of the nonsense you insist on being necessary. But what I want isn't either this or that approach, I want the simple approach in addition what exists now, so we can both have our cakes.