You've pretty much nailed it, yes, that and not hashing the
level of the child hashes internally, you can construct a file which pretends to be upper hashes. That is potentially not just collidable but actually second-preimagable, given what we saw with the much older MD4-based ones - and they used SHA-1, which wasn't a great idea either! (Although, it should be noted, in
(2009) - could a mod mark the headline such?)
The file size being there does complicate an attack - but with the weaknesses in SHA-1, I certainly wouldn't feel comfortable with it.
This is a disaster of a spec, we already had TTH at this point and that at least did it better: it needed revising and should not be implemented by anyone.
Today, you should consider using BLAKE2b's tree hash for this purpose. It walks all over this construct from every direction.