>"try random pairs until you find a collision" method works fine and is arbitrarily parallelizable with no storage.
That requires you to do 2^128 checks on average, not 2^64. Again, the problem is that the birthday problem only exists if you have all the hashes available to compare. If you're just doing two hashes at a time and comparing the two, the chances of you getting a match for each try is 1 in 2^128, and this is independent for each attempt. To get a 50% chance you need 2^128 attempts.
If you can't see why that's the case, you can empirically test this, by using a 16-bit "hash" rather than a 128-bit hash: https://pastebin.com/7xW04Jgg
The average I got during one invocation was 58161, which is 2^15.8, not 2^8 as you'd expect. However, if you modify the code to include a storage/retrieval system: https://pastebin.com/AxF9S0q5, the average drops to near 2^8. For my last invocation, I got 309.7 which is 2^8.2