I think at that scale you would want a ceph expert on staff as a full time salaried position.
For an organization that has 10PB now and can project a growth path to 15, 20, 25PB in the future, you should talk with management about creating a vacant position for that role, and filling it.
> EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.
I am a huge advocate of hosting stuff yourself on bare metal you own, but this is a ridiculous statement. Any drive in that class should come with a 3 or 5 year warranty. And the manual labor and hassle time to replace one (you have hundreds of thousands of dollars of storage and no ready to go cold spares on a shelf?!?!) is infinitesimal.