That was an excellent write up. And great case study in how to iteratively approach a performance problem. First address the algorithmic issues. Then once the algorithm is no longer the bottleneck, look into dropping into a higher performance, lower overhead language like Rust. Jumping directly to Rust with the original algorithm, would not have helped much. And in the general case, getting the algorithm right might make dropping down into Rust unnecessary.
Very nice work.