On my LinkedIn post about this topic someone actually replied with a superior method of steering LLM output compared to anything else I've ever heard of, so I've decided that until I find time to implement their method, I'm not going to worry about things.
tl;dr you put into the prompt all the JSON up until what you want the LLM to say, and you set the stop token to the end token of the current JSON item (so ',' or '}' ']', whatever) and you then your code fills out the rest of the JSON syntax up until another LLM generated value is needed.
I hope that makes sense.
It is super cool, and I am pretty sure there is a way to make a generator that takes in an arbitrary JSON schema and builds a state machine to do the above.
The performance should be super fast on locally hosted models that are using context caching.
Eh I should write this up as a blog post, hope someone else implements it, and if not, just do it myself.