It's possible to further improve the fragmentation characteristics if the maximum size of the memory to be allocated is known ahead of time. The original used a 5.3 "floating point" scheme for the free bins, but given the maximum size you can opt for 4.4 or 6.2 or whatever can cover the whole region, giving more dense (or sparse) allocation granularity as needed.
This same idea can also be extended to allocating texture atlas regions when the regions have power of two size. 256 bins can cover all possible texture sizes that GPUs can handle, so you can get hard O(1) texture packing.
I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README). For that it would be necessary to make a reverse lookup from pointer address to allocation handle, which would require something like a red-black tree covering the address space, which would no longer be 0(1). If anyone has ideas on this front, I would be happy to hear them.
This algorithm is so clever and kind of a game changer. Allocating and freeing are fast real time operations (only dozens of CPU cycles and a few memory writes). It kind of makes "bump allocators" obsolete because this allocator is almost as fast but supports trimming and freeing allocations as well. This algorithm requires more memory but the amount is quite low.
A radix tree can solve this: https://en.wikipedia.org/wiki/Radix_tree
I used one way, way back to do exactly the same thing: upon a free, I needed to look up all of the metadata for an address. For a 32-bit address space, you can just allocate a giant array up front, and use the address as an index. For a 64-bit address space, it's obviously way too big to statically allocate it up front. A radix tree neatly solves the problem. An arbitrarily sized radix tree is not constant lookup, but for reasons I honestly cannot remember, for a 64-bit address space, it's guaranteed to only be 3 levels deep.
See my implementation, which (I believe) I borrowed from tcmalloc: https://github.com/scotts/streamflow/blob/master/streamflow....
Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.
Asymptotic bounds must be presented and understood in context.
[1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathematic...
> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).
In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1). The original TLSF algorithm (as proposed by the paper) was designed this way.
Months later, I also wrote a general purpose implementation of TLSF for Rust `GlobalAlloc`, which, with some rigorous optimizations, turned out to be faster and smaller than every allocator for WebAssembly and embedded environments I tested, and it's still my go-to choice. It didn't fare too bad even when compared against modern thread-caching allocators in normal use cases. Internal fragmentation is probably worse, though.
Yes, but the downside is that now allocator metadata pollutes the cache. It's super efficient for the allocator, but it may harm efficiency of the actual use of that memory. I believe most production allocators don't use object headers for this reason.
Besides the reverse mapping, you'd also have to get rid of the Vec use in the allocator, but that's fairly straightforward. Additionally, you'd need some sort of logic about requesting slabs of memory from the OS (and returning them when free), which is some extra work.
I don't deserve any credit for this; @SebAaltonen did all the work to write the C++ version. I just did a line-by-line translation.
In tiny systems all allocations of one type may come from a single operation, but in larger systems what fits in a slab will come from distinct concerns with different priorities. You want a different arena for those allocations.
- Is it constant time in algorithmic complexity?
- Is it single threaded (i.e. free from locks)?
- Does it need to fetch memory from the system (not realtime safe)?
There are slab allocators that can satisfy all of these, and are therefore realtime safe.
So I'm still wondering what the difference is.
https://github.com/glaretechnologies/glare-core/blob/master/... https://github.com/glaretechnologies/glare-core/blob/master/...
Aw. Why not?
> find the next available bin using 2x LZCNT instructions to make all operations O(1)
Isn't that sort of implementation defined? The LZCNT operation itself is a loop over all bits -- the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64. But if it was 16 or 32 bit, it may be faster, which means its not O(1), or rather, big o notation really breaks down here.
That's O(1).
> The LZCNT operation itself is a loop over all bits
That's not how circuits work. You can easily make a circuit of O(log n) depth that returns base-2 encoded index of the first 1 bit in a n-bit register.
But since n is just 64 here, you're looking at a circuit of AND/OR depth ~8, so it's no surprise that modern CPUs can compute the LZCNT / TZCNT of a 64-bit integer in a single CPU cycle.
A loop over 2^64 elements is also O(1), but I don't think people are ok to label every program that can run on a 64 bit pc as O(1).
For example, in some sense, anything you can do on a finite computer is O(1) (or alternatively, it's an infinite loop). But that is a very boring sense, and seldom useful for analysis.
There's a formal definition for big-O notation, but most of the parameters to that formal definition are only implied in casual discussions.
Because the global allocator API is defined as free(pointer address) and this used free(buffer handle).
It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).
> the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64
O(n) when n is constant is equal to O(1).
But lzcnt uses a fixed number of clock cycles so it's not really O(n) either.
Well, another solution is enlarging each allocation by 8 bytes and storing the buffer handle in front of the returned allocation. So `free(ptr)` would get the handle as `*((ptr as *const u64) - 1)`.
[1] https://github.com/matthieu-m/storage/blob/main/etc/rfc.md
Not necessarily. If you are able to map a block of normal memory in a known location relative to the GPU memory that you are managing with an "offset allocator", it should be possible to directly calculate the metadata location for each "offset". This is how most allocators find arenas/buckets (whatever you want to call them) for an allocation by a pointer.
Something like this:
+-------------------------+ 0x000000 (start of managed memory)
| Metadata |
| |
| |
| ... |
+-------------------------+ 0x001000
| Padding ... |
+-------------------------+ 0x010000
| GPU Memory Block |
| |
| |
| | ~2MB block
| |
| |
| |
+-------------------------+ 0x210000
With this layout, to get to a metadata for an allocation, all you need to do is to align down the allocation pointer and calculate the appropriate location in the metadata page.This obviously won't work in all scenarios, but it's a simple and practical way around a map lookup.
You generally assume for big O purposes, when analyzing sorts for example, that comparisons of elements are constant time, and that swapping elements is constant time.
On an n-bit computer, when dealing with m-bit elements, those assumptions are broadly sound. Comparing two ints doesn’t depend on the ints. Reading or writing the int from memory doesn’t take more or less time either.
But the same algorithm, while it has the same big O relationship to the size of a collection, might have a different constant factor on different CPUs and for different values of m. And you might find that some algorithms that constant factor’s relationship to m has its own big O.
One common place this can bite you is when you try to apply big O analysis that has ‘constant time comparison’ assumption built in to, say, sorting arbitrarily long strings, or inserting into a hash table using arbitrary string keys. Comparing strings or calculating hashes of them is not constant time.
Like, you might have a function with 2 inputs, N and M, and the run time is like O(m2^n), but n is fixed at a low number every time you'll run it in practice, so it's really just O(m) for your purposes.
Even if that were false, it’s O(k) where k is the number of bits, which is constant. However, in that case it might be slower than the alternatives.
> Use one set of large mesh buffers per vertex attribute layout
> Problem: Index/vertex buffers have to be re-bound when the mesh changes. This adds overhead when encoding draws and when drawing. It also prevents some optimisations like being able to draw all objects for shadow mapping for a light in one draw.
> Solution(s): [...] Use an appropriate allocator like a port of Sebastian Aaltonen's offset allocator [https://github.com/sebbbi/OffsetAllocator] to manage allocation
Where the "port of Sebastian Aaltonen's offset allocator" is what got linked in this HN submission.
I've said it before, but if drawcall count is a problem in your game, you should complain to the engine vendor. On modern GPUs, there's no reason people should be manually optimizing for drawcall count in 2024. Game engines have the tools to make it a non-issue; they just need to use them.
which has links to a pertinent paper at the end of that page.
And the linked paper doesn't even contain the word 'offset'.