We used Triton Inference Server (with a Golang sidecar to translate requests) for model serving and a separate Go app that handled receiving the request, fetching features, sending to Triton, doing other stuff with the response, serving. This scaled to 100k QPS with pretty good performance but does require some hops.
In general writing pure Go inference libraries sucks. Not easy to do array/vector manipulation, not easy to do SIMD/CUDA acceleration, cgo is not go, etc. I wrote a fast XGBoost library at least (https://github.com/stillmatic/arboreal) - it's on par with C implementations, but doing anything more complex is going to be tricky.