Not sure why you're being downvoted that is exactly why I'm using that simple problem to benchmark LLMs. If an LLM can't figure out how to traverse a graph in its working memory then it has no hope of figuring out how to structure a proof.
Under natural deduction all proofs are sub trees of the graph which is induced by the inference rules from the premise. Right now LLMs can't even do a linear proof if it gets too long when given all the induced vertices.