Remote updates -- where you replace a full (sub)system -- are one thing, since you can always run the normal software validation procedure on the new version of the software. So an OTA update of a system (even in flight) does not sound like rocket science (yet)...
But: Once you include a REPL or another mechanism to push and execute arbitrary code "ad-hoc", I wonder how that could possibly be tested an validated? Surely as soon as you add the ability to run arbitrary code, there is no way of testing for all possible states of the system as part of the validation process?
In other words, how do you allow the user to push arbitrary code, but prevent them from putting the spacecraft into a condition from which it can not be recovered? The only way I could naively think of would be to only allow the user to push code to a completely isolated CPU that has a remote-reset functionality from the main/comms CPU.
Still, the popsci articles I read made it sound like there might be more to it. It would be excellent to find some first-hand accounts/sources on how this looks like in reality.