Code - https://github.com/BerriAI/litellm/blob/main/cookbook/Evalua...
Are others seeing similar results?