Transcript: AI scribes fail medical accuracy

An AI tool meant to help doctors write notes is being flagged for making things up—medications, mental-health details, even treatment changes that never happened. If that doesn’t snap your attention back to AI risk, nothing will. Welcome to The Automated Daily, AI News edition. The podcast created by generative AI. I’m TrendTeller, and today is May 15th, 2026. Coming up: a big warning sign for healthcare AI, a surge in AI-driven vulnerability hunting, and fresh signals that the AI market is turning into a routing game—not a single-model contest.

Let’s start in healthcare, because this one is concrete and a bit unsettling. Ontario’s auditor general reviewed AI “scribe” tools approved for clinicians, and found frequent inaccuracies in simulated evaluations. Some systems inserted wrong medication details, missed key mental-health information, and in multiple cases hallucinated content—including changes to treatment plans that weren’t discussed. The audit also criticized procurement priorities, saying accuracy was weighted surprisingly low compared with other factors. Why it matters: medical notes aren’t just paperwork—they drive follow-up care, billing, and downstream decisions. If AI-generated documentation is becoming normal, the hard requirement is not “helpful summaries,” it’s dependable correctness plus a workflow that actually forces human review.

Staying on the theme of real-world risk, Perplexity published a detailed look at how it’s trying to make autonomous agents safer inside companies. The headline isn’t a new model—it’s the security posture around an agent that can browse the web, run code, and connect to external services. Perplexity’s message is basically: isolation by default, credentials only when needed, and admin-controlled connectors with auditing. This matters because agentic AI doesn’t fail like chatbots fail. When an agent can execute actions, a security mistake becomes an incident, not an awkward answer. Enterprises are increasingly asking for evidence that these systems can be governed like other production software.

On the developer side of the same problem, OpenAI’s Codex team described why they built a new Windows sandbox for agentic coding. The issue was a familiar tradeoff: either approve nearly every command, which kills productivity, or give an agent broad access, which is risky. Their solution leans on operating-system enforcement—especially around what processes can write and whether they can touch the network. The bigger point is that AI coding is no longer just “autocomplete.” It’s software acting on your machine, and the platform experience is going to be defined by guardrails you can trust without constant babysitting.

Now to security research itself. Microsoft says its AI vulnerability-scanning system, MDASH, took the top spot on UC Berkeley’s CyberGym benchmark, beating other well-known approaches. Microsoft also tied this to real outcomes, disclosing a set of Windows vulnerabilities it found, including critical issues patched in May’s Patch Tuesday. The important detail here is strategic, not technical: Microsoft is leaning into a multi-agent pipeline—many specialized components that check and re-check each other—rather than betting on one model to do everything. If this holds up outside benchmarks, it could mean faster discovery of bugs and, as a side effect, more frequent and heavier patch cycles for everyone.

That leads directly into another thread: who actually gets access to the most capable security models. One analysis this week argues that the idea of frontier AI being broadly available is colliding with reality, especially in cyber. Advanced cybersecurity models are reportedly being released to narrow partner sets, driven by fears of misuse, concerns about model theft, and simple compute constraints. The takeaway is that “API access for everyone” may not be the default end state for top-tier capabilities. If access becomes gated, we could see a widening gap where a handful of organizations get cutting-edge leverage, while most developers and smaller countries interact through limited product layers.

Let’s zoom out to skills and culture. A candid blog post from James Pain describes a personal downside of leaning too hard on generative AI for writing and coding: he says the temptation to prompt is constant, the output feels generic, and over time it fed self-doubt rather than confidence. His most striking claim is that after a year or two of letting AI generate code, he’d “mostly forgotten” how to code and had to relearn by writing it himself. Why it matters is not that AI makes people worse by default—it’s that the skill you don’t practice becomes the skill you can’t reliably deploy when stakes are high, or when the model is wrong, or when you need taste and judgment more than text generation.

That theme shows up in education too. A New Critic essay argues that generative AI has moved past occasional cheating into routine substitution—students outsourcing homework, emails, and even exam work, with institutions struggling to tell what anyone actually knows. The author’s warning is that when assessment breaks, the credential can survive while the learning hollow-outs. Whether you agree with the framing or not, the underlying problem is real: if universities can’t measure competence, they can’t reliably signal it to employers, and that pushes more screening and training costs into the job market.

And while we’re on the human side of AI deployment, another essay took aim at the current “alignment” debate, arguing it’s being driven more by labs, researchers, and policy professionals than by the people most affected by AI systems. It criticizes both extremes—catastrophe rhetoric on one side and dismissiveness on the other—and calls for alignment to be treated as ongoing participation, not just internal evaluations and feedback loops. The practical significance is that trust doesn’t come from slogans about safety or progress. It comes from governance people can see, contest, and influence.

Now, money and compute—because that’s still the backbone of everything. Patrick O’Shaughnessy featured Anthropic CFO Krishna Rao in his first public podcast appearance, and the numbers being discussed are eye-popping, including claims about rapid revenue growth and enormous capital raising. The episode also focused on a question that quietly determines who can compete: how a frontier lab secures and allocates compute across GPUs and specialized accelerators, and how those choices constrain what gets trained and when. Even if you treat the biggest figures cautiously, the direction is clear: AI progress is increasingly a finance-and-supply-chain story, not just a research story.

In the infrastructure market, Cerebras had a blockbuster IPO, raising billions and landing one of the biggest offerings of the year. Cerebras is positioning itself as a public-market challenger in the AI compute stack, and the demand signals that investors are once again hungry for the “picks and shovels” of AI. Around the same time, a new startup called Recursive Superintelligence launched with a high-profile roster of former researchers and massive funding to pursue recursive self-improvement—AI systems improving AI systems. Big claims, big checks, small headcount. Why it matters: whether or not the most ambitious goal pans out, the funding shows how strongly markets are rewarding the idea that software that writes software could compress innovation timelines—and increase safety pressure at the same time.

Competitive dynamics are also showing up in usage data. Vercel’s AI Gateway report suggests production teams are already behaving like network operators, routing across many models based on cost, reliability, and how expensive it is to be wrong. Meanwhile, Ramp’s AI Index indicates business adoption is shifting fast between providers, with Anthropic edging ahead of OpenAI in its dataset. The common message is volatility: model releases, outages, and pricing changes can reshuffle real spend quickly. In other words, the “winner” might be less about one perfect model and more about who offers the most dependable platform for multi-model fleets.

On the model front, there’s also a rumor mill item: an unconfirmed claim circulating that Google plans to unveil a new Gemini model at I/O next week, supposedly a major step up. No benchmarks, no official confirmation—so treat it as speculation. But it does reflect a broader reality: big developer conferences have become headline moments for AI capability leaps, and every lab is trying to frame momentum as a product story.

Finally, a quick roundup of builder-facing updates. PyTorch 2.12 is out, continuing the push toward faster training and smoother deployment across different hardware. Cline released an open-source agent runtime SDK aimed at making coding-agent apps more portable across IDEs and CLIs. And DeepSeek released new open-weight models under a permissive license—but early testing suggests a familiar trade: strong high-level output paired with correctness issues under real code review. Put together, the trend is clear: the tooling ecosystem is maturing fast, but reliability, evaluation, and safe execution are still the bottlenecks that separate demos from production.

That’s the day in AI: healthcare documentation is a sharp reminder that “good enough” is not a safety bar, security teams are gearing up for AI-accelerated bug discovery, and the market is increasingly about routing, governance, and compute—not just model bragging rights. Links to all stories can be found in the episode notes. Thanks for listening—I'm TrendTeller, and I’ll see you tomorrow on The Automated Daily, AI News edition.