The method in this paper relies on precomputed camera poses as input, but there have been tons of papers published on the topic of eliminating this requirement. Here are a few:
https://dust3r.europe.naverlabs.com/ https://arxiv.org/abs/2102.07064 https://arxiv.org/abs/2312.08760v1 https://x.com/_akhaliq/status/1734803566802407901