As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.
It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.
So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.
You ask an LLM "What's wrong with your answer?" and you get pretty good results.
Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.
The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.
I think they're getting better at it, but it's likely just the number of parameters getting bigger and bigger in the SOTA models more than anything.
"Give me your answer and rate each part of it for certainty by percentage" or similar.
They don't like hearing "I don't know"
In other cases, I have seen it miss the mark when the discussion is not very linear. For example, if I am going back and forth with the SOC team about their response to a recent alert/incident. It'll get the gist of it right, but if you're relying on it for accuracy, holy hell does it miss the mark.
I can see the LLM take great notes for that initial nurse visit when you're at the hospital: summarize your main issue, weight, height, recent changes, etc. I would not trust it when it comes to a detailed and technical back-and-forth with the doctor. I would think for compliance reasons hospitals would not want to alter the records and only go by transcripts, but what do I know...
She called me back later that night and we chatted for bit and then she paused and sort of uncertainly was like “So… was there something you were needing to tell me?” And I was completely baffled and was like “Uhhhh I don’t think so…?”
She then explained the notification she got about my call and apparently the LLM summary of my voicemail converted a message consisting of 75% well-meaning but insignificant interpersonal human filler (like most voicemails) into this stilted, overly formal business-y speak with a somewhat ominous tone. Assigning way too much significance to each of the individual statements in the message about wanting to talk (to say happy Mother’s Day), inquiring about her availability ASAP (to say happy Mother’s Day) etc. Plus grossly exaggerating the information density of the call making it sound like I left this rambling, detailed message about needing to tell her something that was left completely vague, but possibly important and also time critical.
Added up it made her a little worried when she read it and made me a bit pissed that was the end result of my wishing her well. Because apparently everything needs a half baked LLM summary crammed into it now.
ALWAYS check your summaries immediately, and contact your doctor ASAP. They can generally fix it themselves, and it's best done when everyone still has some memory of the event.
I'm puzzled by this as well. Why not just generate a transcript and be done with it? If it's a particularly long transcript that's being referenced repeatedly for whatever reason let the humans manually mark it up with a side by side summary when and where they feel the need. At least my experience is that usually these sort of interactions don't have a lot of extraneous data that can be casually filtered out to begin with. The details tend to matter quite a lot!
The businesses offering these services want to say "we are using AI" to their stake holders and the government committees who approve this shit don't have the skills or knowledge to evaluate the effectiveness in addition to the fact they likely don't even use the tools they have approved for use.
Transcription is both too good, and not good enough. The magic generative content only makes it worse.
Too good: a lot of commercial settings forbid persistent transcription because it makes an easily discoverable record of specific details. Thats a business risk that can be mitigated simply by having participant notes or summaries where the secretary can omit sensitive discussion or present consensus without specifics. And notes/summaries also introduce a interpretive defense with some “strategic ambiguity.”
Not good enough: if you look at STT its still probabilistic. The actual evaluation output will have just much data about alternate words/phrases as the selected choice. That leaves lots of room for creating alternate impressions or representing words that werent actually spoken. The fact that people _think_ a STT transcript is authoritative only makes this worse.
When you add generative inference in top (eg summarization) you exacerbate both problems. I suspect that counsel is more accepting of summaries as its less likely to contain specific discoverable terms, likely to diffuse responsibility and specificity, and your judge/jury will be more amenable to “the ai summary is wrong” than “the transcription selected the wrong vowels.”
Diagnosed with Runner's Knee.
AI summary said I was diagnosed with osteoporosis, and had hip pain and walking difficulty, though literally none of that was ever said or implied.
CHECK YOUR TRANSCRIPTS. Always, but especially with LLM transcribers, which fairly frequently include common symptoms which don't exist, or claim a diagnosis which is common and fits a few details but not others. Get them fixed, it can very strongly affect your care and costs later if it's wrong.
Anecdotally, I'd say that outside of a couple very simple and very common things, about 50% of the "AI" summaries I've had have been wrong somewhere. Usually claiming I have symptoms that don't exist, occasionally much more serious and major fabrications like this time.
LLMs are NOT normal speech to text software, and they shouldn't be treated like one. They'll often insert entire sentences that never occurred. In some contexts that might be fine, but definitely not in medical records.
Someone else who couldn't attend the meeting later read that summary and it created a major argument because the topic had been a sore subject for this person due to an ongoing debate at the company. Everyone who attended the meeting confirmed it was an error, but the coincidental timing made it hard for him to accept, because the LLMs summary presented things in a way that validated this person's concerns that had been previously minimized by some folks on that meeting.
The drama got heated to the point where management produced a policy about not trusting generative output without independent verification. Seems at least it was a lesson learned.
There are some good uses for AI, but I'm not convinced that this (or may other cases where accuracy matters) is one of them.
Freeing up doctor time, for example: lots of patient visits are messy, the patient is scattered, has multiple issues, and the doctor has tight timelines and regulatory challenges to convey to the patient impacting their care… this is architected for everyone to lose, IMO, even with a perfect transcript. And LLMs can’t be perfect, they auto complete.
I picture patients interacting with an intake AI who can listen to hours of demented rambling, or a patient mid anxiety attack, and provide a caregiver-certified summary of needs, with relevant screening information laid out for doctor confirmation. At that point, helpful information about drug access or insurance policies can be presented, for doctor confirmation, to a patient who can clarify and refine their understanding of the system without time pressures.
Elevating the quality of dialogue so the doctor is more focused on the patient, and the patients dialog needs don’t overwhelm treatment. A lot of medicine is filling out forms and checklists, I think auto-complete could create efficiencies in how we fulfill that.
“Notice: Any comments made by <name> or on behalf of <organization> that are interpreted by AI in this meeting, may not be accurate.”
I do this in every meeting.
She is a great doctor and thankfully does this due diligence. But it gives me the impression this is forced on doctors without even them wanting this.
If we just postulate that the systems have a high error rate, I wonder why they are being adopted. They seem extremely easy to test, so I don't see why doctors or hospitals or governments should be getting tricked into buying them if they suck.
From the article: "While 30 percent of a platform’s evaluation score depended solely on whether they had a domestic presence in Ontario, the accuracy of medical notes contributed only 4 percent to the total score."
Accuracy wasn't really part of the scoring, Ontario doesn't care about it.
makes me wonder what quality software the ministry would push (probably mostly qualifications like SOC).
This is apparently this list of approved vendors
https://www.supplyontario.ca/vor/software/tender-20123-artif...
Not mentioned, as far as I can see: the comparative human mistake rate.
Having seen a lot of medical records, 60% sounds about normal lol.
(And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury? At least one country's health system in the world should have caught that)
Because most of it is just written down and never looked at again until there’s a lawsuit or something.
I do wonder if people would be pushing AI so hard if their organizations were planning to hold them accountable for mistakes the AI made
I bet if that were the case we'd see a lot slower rollout of AI systems
Or do they use traditional voice recognition algorithms to do that part and then just "fix" the result to look plausible? Which with good quality output might not be much, but with bad can be absolutely everything.
If it is later seems to me that issues will absolutely happen.
I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.
Are these tools instead immediately summarising the whole thing, and that summary is the artifact? Because that is a beyond insane way to treat human communication.
> I would expect an "AI Note Taker" to faithfully transcribe the entire conversation. With the same quality I see in a lot of automated video subtitles.. ie they use the wrong word a lot but it's easy to tell what they mean by context.
That's a reasonable expectation, but would not be a safe one. All transcription tools are not made the same. First it depends on what kind of STT/ASR (speech-to-text / automatic speech recognition) model they are using. A lot of tools like to use some flavor of OpenAI's Whisper model. It works well generally but I would never use it in a critical use case like healthcare. Because it can hallucinate. That's specific to its architecture and how it was trained.
There's a fairly large variety of architectures that can be used for STT/ASR. Some of them are designed for "offline" / "batch" / pre-recorded audio. Some are designed for fast real-time streaming transcription.
There are more factors too like training data. And not just demographics of the speakers in the training data but audio environments too. Was the model trained on echo-y doctor offices with two people being recorded from a crappy smartphone mic or desktop mic? (It could've been! But it's an important distinction.)
And there's more factors than that, but you get the picture (e.g. are they trying to "clean up" the transcript afterwards by feeding it to an LLM, are they attempting to pre-process audio before transcription also in an attempt to boost accuracy)
There's a lot of ways to do it, meaning, there's a lot of ways to screw it up.