I would have expected 2048 items to be the optimal cutover point, because with 16-bit entries that would result in 4KB arrays. Those fit exactly into a typical CPU memory page, whereas 8KB requires two. In the worst case that might double the page table overheads, e.g.: for random point reads.
I wonder if it would be worth the trouble to code-gen a bunch of variants with things like 8-bit entries, and benchmark to death to determine the optimal cutover points...