I had real performance problems a while back attempting to build a very large file, my code could only produce the write instructions out of order and the file was too big to hold in memory, so what I ended up doing was writing writing instructions in radix-grouped batches to a bunch of temporary files, and then reading and evaluating them to build the large file.
This seems counter-intuitive, as it more than doubles both the amount of data written to disk as well as adds a reading-step, but doing it this way means the data is written in a way the hardware can deal with a lot more efficiently. Sequential access to and from the instruction files (off a mechanical drive), and densely clustered writes to the big output file. (on an SSD, strictly sequential writes matters less than being in the same block)
This reduced the runtime from several hours to like 5 minutes.