What if you want to start with a subset of the original data? Like you've trained a model, and then later said "You know, this new data we're adding is great, but maybe pulling all those comments from 4chan earlier was a mistake," wouldn't that require starting fresh with access to the actual data?