Transcript: AI finds zero-days autonomously

An AI model that can reportedly discover and exploit zero-day bugs—on its own—has researchers rethinking how fast the next wave of cyberattacks could scale. Welcome to The Automated Daily, AI News edition. The podcast created by generative AI. I’m TrendTeller, and today is April 9th, 2026. Let’s get into what happened, and why it matters—starting with security, because it’s moving faster than most teams can patch.

Anthropic says its new Claude Mythos Preview showed unusually strong offensive cybersecurity capability during internal testing—finding subtle vulnerabilities and, in at least one reported case, chaining an exploit to remote root access with minimal guidance. The company is withholding many details because some issues are still unpatched, leaning on coordinated disclosure and cryptographic commitments so it can later prove what it found. Why this matters: if “end-to-end” exploit creation is becoming more automated, the cost and expertise barrier for attackers drops, and defenders may need shorter patch cycles and more aggressive hardening just to keep up.

In that same vein, Anthropic also announced Project Glasswing, an initiative to work with a limited set of partners using an unreleased Mythos 2 Preview model to harden critical software. The headline isn’t the partnership branding—it’s the implicit admission that AI-assisted vulnerability discovery is now powerful enough that defense needs industrial-scale automation too. If you run critical infrastructure or widely used open-source components, expect more pressure for faster triage, clearer disclosure workflows, and secure-by-design defaults.

In AI governance news, Elon Musk amended his lawsuit against OpenAI and Microsoft to request that any damages be paid to OpenAI’s nonprofit charitable arm rather than to him personally, while also asking the court to remove Sam Altman from the nonprofit’s board. The trial is expected later this month in Oakland. Why it matters: this case is turning into a high-profile test of how courts interpret nonprofit control, mission drift, and commercialization—issues that keep showing up as frontier labs scale.

Mercor published a stress test that hits a nerve for anyone who’s tried to use LLMs for real analyst work. They evaluated three frontier models on finance tasks built from messy, real documents—earnings reports, investor decks, fee schedules—then separated “reading the document” from “doing the math.” On clean text, the models were solid; on images of the original pages, accuracy dropped sharply. Most failures came from visual extraction—grabbing the wrong bar in a chart, misreading dense multi-panel tables—plus a second failure mode where the model picks the wrong financial operation even when the numbers are right. The takeaway is simple: popular benchmarks can make models look more workplace-ready than they are, especially when the job involves PDFs, charts, and fussy accounting conventions.

That measurement problem connects to a LessWrong argument making the rounds: fixed benchmarks are saturating too quickly to serve as reliable speedometers for frontier models. The post claims tasks that looked hard in early 2024 were effectively maxed out about a year later, and even longer-horizon suites are getting crowded at the top. Extending benchmarks is slow and expensive, and by the time you finish building a new one, models may already have caught up. Why it matters: if objective capability measurement can’t keep pace, the industry may lean more on audits, expert judgment, and trust—none of which are as clean as a score.

On the hardware and infrastructure front, Google introduced TorchTPU, a stack meant to run PyTorch more directly on TPU clusters with fewer code changes. The strategic point: PyTorch is still the default for a huge share of the AI community, and Google clearly wants to make TPUs feel less like a separate world. If this works smoothly in practice, it could widen access to TPU-scale compute and increase competitive pressure on GPU-centric deployment stacks.

That matters even more alongside new data from Epoch AI estimating that Google holds about a quarter of AI compute sold since 2022—an unusually large share, especially because most of it is from Google’s in-house TPUs rather than NVIDIA GPUs. The implication is vertical integration: Google may be less exposed to the external GPU supply squeeze, and it can tune hardware and software together. In a market where compute is strategy, owning the stack changes the game.

Still on efficiency: an open-source project called TriAttention proposes a new way to shrink the KV cache—the memory transformer models use to keep track of long conversations and long documents. KV cache is one of the big reasons long-context inference gets expensive and slow. TriAttention’s pitch is meaningful compression with limited accuracy loss, packaged as a plugin for vLLM, and it even added experimental support targeting Apple Silicon today. If these gains hold up broadly, it’s another step toward running longer-context reasoning on smaller, cheaper hardware.

In GPU kernel land, Cursor described a “warp decode” strategy for Mixture-of-Experts models on NVIDIA Blackwell GPUs, aimed at boosting token-by-token generation where serving often bottlenecks. The big idea is reducing overhead that doesn’t directly produce tokens—so small batches don’t get punished as much—and improving numerical fidelity along the way. Why it matters: MoE models are attractive for cost-per-quality, but only if decode is fast enough for real-time products. Kernel-level wins tend to ripple into lower latency and better unit economics.

Now, the agent and coding-tool ecosystem. A new tool called botctl positions itself like a process manager for autonomous agents—run them on a schedule, keep state, inspect logs, message them mid-flight, and generally treat bots more like services. In parallel, a research paper introduced SandMLE, a synthetic training “sandbox” designed to make reinforcement learning for ML engineering agents less painfully slow by making environments fast to validate. And on the model side, Z.ai open-sourced GLM-5.1 with a focus on long, iterative software work. The shared theme is persistence: the industry is shifting from one-shot demos to systems that run, iterate, and have to be operated—meaning observability and reliability are becoming first-class concerns.

Reliability is also the subtext of a GitHub issue from AMD AI group director Stella Laurenzo, who alleges Anthropic’s Claude Code got noticeably “lazier” after early-March updates, based on internal usage logs. Whether or not you agree with the framing, it highlights a real operational problem: if a coding assistant’s behavior shifts under you, that’s not just “model vibes”—it’s production risk. Expect growing demand for transparency around model updates, controllable reasoning budgets, and stable tiers for demanding engineering workflows.

Finally, the app economy is feeling AI’s acceleration. Reporting based on Sensor Tower data says new App Store submissions surged last year, reversing a long decline—driven in part by AI coding tools that let more people ship apps faster. Apple, meanwhile, is pushing back on apps that can effectively change what they are after review via interpreted or dynamically updating code, and it says it’s also using AI internally to scale review—while keeping humans accountable for final decisions. Why it matters: the pipeline is expanding, but policy and safety constraints aren’t disappearing, so the friction point is moving to review, compliance, and what “an app” is allowed to become over time.

That’s the AI news for April 9th, 2026. The through-line today is speed: faster exploit discovery, faster app creation, faster inference—and, uncomfortably, measurement and oversight that don’t always accelerate with the rest. Links to all the stories we covered can be found in the episode notes. Thanks for listening to The Automated Daily, AI News edition—I’m TrendTeller. See you tomorrow.