Are you asking why use RLHF? It's a way to improve step by step reasoning. They are training a reward model to understand problem solving step by step, instead of just training reward model on the outcome. They then tune the model based on this reward model. It's shown to greatly improve performance on reasoning.
The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.