No comments yet.
How are people going about evaluating the responses of AI agents these days? Particularly for conversational flows - the problem seems more complex because it could require keeping the entire conversation in context.
Any help or resources will be quite appreciated!