We only tested this with the 14B model. You can see the run here:
https://wandb.ai/bradhilton/rl-experiments/runs/062
Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.