I think we generally both agree that there are some poor misimplementations of the test, like the one you linked where (according to their paper) the interrogator could answer "unsure" on a bot's response and count as being "fooled" by that bot even if they then answer "human" on a human, which does allow for giving nonsense answers to be a legitimate strategy (unlike with Turing's specification, I'd claim).
Ultimately I do think Turing's experiment measures something interesting. There's a nice "minimal maximality" to it, in that it's a simple game yet set up in a way that solving it encompasses all facets of intelligence that current humans have. Maybe coincidentally comparable to the test for Turing completeness, in that a Turing machine is conceptually simple yet simulating it proves computational universality. I feel there's a risk of missing the nuance and just taking the experiment as a singular benchmark, whether it's made "easier" or "harder", akin to "simulating a Turing machine is too easy, how about simulating the Numerical Wind Tunnel?"
> Rapid recursive self improvement
I'm a bit sceptical of a hard take-off scenario.
Say on first pass it cleans up a lot of obvious inefficiencies and improves itself by 50%. On the next pass it has more capacity to work with, but the low-hanging fruit are already dealt with, so it probably only manages to squeeze out an extra 10%. To avoid diminishing returns, it'd need to automatically build better chip fabrication plants, improve mining equipment, etc. so that many steps in the pipeline are improving. This will all happen eventually, and contribute to humanity's continuing exponential progress, but IMO will be a relatively gradual changeover (as is happening now) rather than an overnight explosion from some researcher making a bot that can rewrite itself as soon as it can "actually think", whatever that entails.