1
Ask HN: Who is honestly evaluating AI outputs and how?
Especially with multimodal AI conversations, evaluating and benchmarking these models is an increasingly complex topic, but a frustrating interaction with AI can really leave customers feeling sour about your whole product / service.
For an in-product AI assistant (with grounding, doc retrieval, and tool calling) I'm having a hard time wrapping my head around how to evaluate and monitor its success with customer interactions, prompt adherence, correctness and appropriateness, etc.
Any tips or resources that have been helpful to folks investing this challenge? Would love to learn. What does your stack / process look like?