The reference paper for rotary embedding is Roformer https://arxiv.org/pdf/2104.09864v4.pdf
First you shouldn't rotate the values, only keys and queries. This is wrong : v_out = (torch.bmm(v.transpose(0,1), self.R[:m, ...])).transpose(0,1)
Second you shouldn't apply multihead attention which as additional inner weights that will mess with the rotations you have just done. This is wrong : activations, attn_weights = self.multihead (q_out,k_out,v_out)
Instead you should use scaled_dot_product_attention( q_out,k_out,v_out)
Third, each attention head should have been treated similarly, and each attention head should have the same rotation frequencies.