> One, this tech absolutely could be used to fool someone.
The problem I have here is that it's already not hard to fool people. I don't think it's feasible for us to say that we're going to put something that could be highly beneficial on hold just because we don't want to deal with social education efforts that we kind of already need to tackle anyway. Per your example, if we get rid of deepfakes, it's not clear to me that Youtube is going to be any more safe. I already would not allow a child to browse Youtube unattended, people already generate the videos you're talking about.
And I know that people are putting this in a different category than general CGI, voice modulation, or consumer-grade apps like Photoshop. I'm not going to argue that it's necessarily wrong for people to be worried, but no matter how many times people tell me that this is fundamentally different, I still have not seen any serious evidence that this technology is going to be more dangerous than Photoshop, and I think it's going to be way easier to detect than a decent Photoshop job is. Photoshop's content-aware paste/fill tools are better than this example, and they arguably require less work to use.
And again... I'm sympathetic to concerns about moving too fast, but I just don't think there's any world, even if you could get rid of deepfakes entirely, where we don't need to be worried about media literacy and general skepticism. If people today don't realize that voices can already be convincingly faked, then that's a really serious problem, and if democratizing that ability causes society in general to become more aware of the potential of disinformation, then honestly that might even be a good thing that we should be encouraging.
So sure, concerns, but in my mind people are focusing on one particular implication that I don't think is particularly likely, and ignoring that responding to that concern is probably going to look the same no matter what our position on deepfakes is.
> If you train a model to mimic a performance given by an actor, then use that model and fire the actor, isn't that potentially really problematic?
I think that's a very complicated question. I would not assume that the loss of work for voice actors, who can shift into voice generation roles, is going to be a big enough downside that it overrules the upside of allowing ordinary people to start generating their own vtube avatars or commenting on and building on top of existing culture.