The memcpy doesn't use a vlmax, it uses a hand-chosen vl. The load won't fault on any elements not loaded (here, elements past the vl), so the memcpy is fine as it only loads items it'll definitely need, whereas your original code can read elements past the null byte.
And yeah, aligning the pointer manually would work (though then it wouldn't be portable code, as the spec does allow for rvv implementations with VLEN of up to 65536 (8KB per register; 64KB with LMUL=8), which'll be larger than the regular 4KB pages).