I'm not hopeful that "seeing the map" (multimodal training) will make the difference everyone is hoping for. The transit map is going to be the exact same information as the lists, but coded in the learned design language of colored lines and dots. The words-only version should work just as well or better because it shortcuts the implicit OCR problem of trying to make it learn off the map. Indeed transit maps other than NY are often abstracted and have nothing to do with the underlying geography. So abstract representations (such as lists of words) should be fit for purpose.
Here's another one that fails spectacularly. The digits 0-9 drawn as an ASCII 7 segment display. It gets it mostly correct, but it throws in a few erroneous non-numbers and repeated/disordered/forgotten numbers. Asking it for ASCII drawings of simple objects can really go off the rails quickly.
The fault mode is very consistent. When a prompt forces it to be specific and accurate on an unambiguous topic for 10 or more line items, it will virtually always hallucinate at least one or two. Especially if the topic is too simple to hide behind a complex answer. Even if its learned not to hallucinate 90% of the time, and even if that's good enough to pass at first glance, within a list of 10 things it only has a 35% chance of not hallucinating any of them.
For what its worth, it did very well on law questions. Try as I might, it refused to accept that there's a legal category known as "praiseworthy homicide". Though, I suspect this has less to do with the underlying model and more to do with openAI paying special attention to profitable classes of queries.
I'm sorry to say, I think the problem maybe more intrinsic to the current approach to AI. Frankly, LLM works unreasonably beyond my expectations, but its making up for a lack of a real first principles theory of cognition with absurd amounts of parameters and training.