That works but...
This kind of allocators are usually used for suballocating GPU buffers, so hiding a few bytes of metadata "in-band" can mess up your alignment requirements and not all kinds of GPU memory are even accessible by CPU. Due to false sharing cache problems you would probably want a full cache line (64 bytes) to store the metadata.
For a CPU-only memory allocator your idea could work quite well. It can also be implemented on top of this code without any modifications.