By "non-trivial word sizes," I presume you mean larger than a single word of most computers? My understanding is that modern BDD/ZDD techniques have gone a long way to showing how to do the entire operation in a single pass for most sized numbers. Such that propagating the carry information isn't really how it is done, anymore. (This is a genuine question, btw. I have been a long way from my VLSI classes back in college.)