Long-term maintenance is a confounding factor of course. People sometimes write software faster with the downside that it wil be harder to maintain.
So remove the confounding factor (at the same time removing most projects we would like to apply this to) and it becomes an easy problem?
But there are many confounding factors. Your data science project results will be different (different metric values outputed) and you don't know which one is correct, or closer to correct. Now how does the "projects/month" number look?
Or maybe one of the engineers uses way more compute-hours for getting the job done. Their projects/month number is better, but is that really a better outcome? Compute is not free.
By the time you remove most confounding factors, the productivity measure will only apply to an insignificant number of projects.