A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?
That has been possible for years, and is even a typical student assignment in speech processing courses. A quick search gave this example course at Cornell