> Your analogy inadvertently once again emphasizes the point. Now change it from being a rabbit and a turtle, to an unknown animal and some non-animal thing pretending to be another unknown animal. And you have to guess which is which. It would be effectively impossible to figure anything out, because you have absolutely no basis to work from.
It'd be possible to get an idea if there was some box movement that was unique to animals. That's not particularly interesting because it's fairly uncontroversial that a box-sized robot could very accurately imitate an animal through the medium of box movement, but for a bot to imitate a human through the medium of text (seen as a sufficiently general interface to test "almost any one of the fields of human endeavour that we wish to include") is interesting to many.
But, the concept the analogy was demonstrating was really just basic reasoning. That, if you're given X xor Y and have evidence of Y, you should tend towards Y even lacking direct evidence for/against X. Do you agree that, in my example, you would choose the box giving some evidence of being a rabbit over the one that gives none?
> LLMs are trained on nothing except the corpus of human knowledge. It is literally impossible for them to e.g. accidentally say something that it's inconceivable for a human to say
Depends on what you mean by "inconceivable", but it's certainly possible for it to say things that it is unlikely for a human to say due to the bot's limitations (at the extreme, consider a Markov chain). And, even if only saying things that a human could just as well say, if those things are also trivial for a bot to say it is poor evidence of personhood.
> And no, always giving bad answers it not a failing strategy. As I mentioned, the scenario I'm describing is not a hypothetical. The Turing Test (or at least yet another abysmal bastardization of it) [...]
To put relevant emphasis on my claims:
> > Always giving bad answers just because humans can also give a bad answer is already a failing strategy with low success rate when the test is carried out as Turing specified
> > Then the real human B would, on average, offer far more compelling evidence of personhood and the bot would fail the majority of the time. I don't see how this issue affects Turing's proposed version of the experiment.
I agree that there are ways to bastardize the test. If for instance you have no second player that you must choose between (have to say A xor B is a bot), then just remaining silent/incoherent to give no information either way can be a reasonable strategy. As with all benchmarks, you also need a sufficient number of repeats such that your margin of error is low enough - fooling a handful of judges does not give a good approximation of the bot's actual rate.
I'd even claim it's a bit of a bastardization to use Turing's 30% prediction (of where we'd be by 2000) to reduce the experiment down to just pass or fail. Ultimately the test gives a metric for which the human benchmark is 50%.