Transcript

Agents need real cloud environments & Durable execution for long tasks - AI News (May 23, 2026)

May 23, 2026

Back to episode

A major lab just bragged its model ran autonomously for 35 hours, made over a thousand tool calls, and still stayed coherent enough to squeeze a big performance win out of unfamiliar hardware. Welcome to The Automated Daily, AI News edition. The podcast created by generative AI. I’m TrendTeller, and today is May 23rd, 2026. Let’s break down what happened in AI—what’s changing, what’s overstated, and what actually matters if you build or buy this tech.

Let’s start with what it takes to make coding agents actually work in the cloud. Cursor shared a candid lesson from building cloud-based coding agents: the hard part isn’t “run the agent on a server.” It’s building an operating layer around it—an environment that looks and feels like a real developer machine, with the right dependencies, tooling, and project context. And the sneaky part is that missing pieces don’t always throw obvious errors. They can show up as subtle quality drops—worse fixes, weaker refactors, more time lost to chasing the wrong path. Cursor also says that as agents graduate to long, parallel, unattended work on dedicated VMs, reliability stops being about your laptop and starts being about cloud realities—outages, node restarts, and transient failures. Their response: durable execution with Temporal, and a clean separation between the agent loop, the VM state, and the conversation state—so work can retry or even move machines without the user experience falling apart. The bigger takeaway is where this is headed: “self-healing” agent environments that can detect missing secrets or blocked network access and remediate issues, reducing the babysitting cost as agents take on more production work.

That theme—agents becoming expensive, always-on coworkers—connects directly to what’s happening inside big companies. Microsoft is reportedly canceling most direct licenses for Anthropic’s Claude Code and pushing teams toward GitHub Copilot CLI. Sources say Claude Code became popular during an internal experiment and was often preferred, but Microsoft’s leadership framed the shift as a matter of control: Copilot is a tool Microsoft can shape tightly around its own repos, security model, and engineering workflow. The timing also looks like classic fiscal-year budgeting—cut recurring spend before the year closes. Whether or not that’s the whole story, it’s a signal: enterprises are trying to consolidate AI tooling into a smaller set of providers they can govern, negotiate with, and integrate deeply.

And it’s not just Microsoft. A broader “AI cost problem” is getting harder to ignore. Another report ties Microsoft’s pullback to a familiar pattern: usage accelerates faster than budgets. Uber is cited as a parallel example—leaders said the company effectively burned through its entire 2026 budget for AI coding tools in four months after gamifying adoption. Even Nvidia’s Bryan Catanzaro has remarked that compute costs for his team can exceed employee costs. The important nuance here is that lower per-token prices don’t automatically mean lower bills. Agentic systems often consume vastly more tokens per task because they plan, retry, branch, run tools, and keep long context. So total consumption can rise faster than unit prices fall. If you’re rolling out agents company-wide, the new skill isn’t just prompt craft—it’s cost containment, policy, and measurement.

A developer survey adds some data to the vibe. The 2026 State of Web Dev AI survey reports AI-generated code is now a majority of what many respondents ship—over half on average—and the share of people using AI “constantly” roughly doubled year over year. It also suggests the interface is changing: coding agents are becoming the default way people interact with models, not just chat boxes. But the pain points haven’t magically disappeared. Hallucinations and inaccuracies remain the top complaint, followed closely by code quality and lack of context—basically, the exact issues that show up when an agent doesn’t fully understand your real system.

One of the clearer voices on that reality check is Josh W. Comeau. His argument is simple: AI is impressive, but people are drawing the wrong conclusion from that impressiveness. The biggest wins often come from highly skilled engineers using AI to amplify expertise—not replace it. Meanwhile, “vibe coding” without fundamentals can hit a wall because the model optimizes the next answer, not the long-term architecture. The practical implication is that the bar for judgment—what to trust, what to reject, what to refactor—matters more, not less, as AI writes more code.

Now to that long-horizon agent headline. Alibaba’s Qwen team unveiled Qwen3.7-Max, positioning it as a foundation model for agentic workloads like coding, office automation, and autonomous tool use. The eye-catching claim is a 35-hour autonomous kernel-optimization run with more than a thousand tool calls, reportedly producing a large speedup on previously unseen hardware. Whether you take the numbers at face value or not, the direction is real: labs are trying to prove models can stay coherent through long stretches of planning, execution, and verification—because that’s what real work looks like. Qwen also frames its progress as “environment scaling,” meaning the harness, the tests, and the verifiers matter as much as raw model size. If that philosophy sticks, we may see a shift where model releases are judged less by single-shot benchmarks and more by endurance under tool-driven, real-world constraints.

There’s also a quieter counter-trend: capable AI getting cheaper, especially outside the frontier. One analysis argues that the huge drop in inference costs over the past few years is driven more by software and model-side improvements than by new GPUs. The author describes switching parts of their workflow from a premium hosted model to an open-weight Qwen variant running locally on consumer hardware, cutting costs dramatically for certain tasks while still acknowledging gaps on others. The bigger point is pricing power: if “good enough” local models keep improving, frontier providers may still win the hardest tasks—but they’ll face real pressure on everyday workloads like triage, summarization, and routine coding help.

So who actually controls the world’s AI compute right now? An Epoch AI analysis argues that even after igniting the boom, frontier labs still use only a minority of global operational AI compute. A lot of the world’s GPUs are tied up in inference, open-model deployments, and non-LLM workloads like recommendations, vision, and biology. But the piece also warns that frontier labs—especially as they sign giant supply deals—may be growing their compute faster than the rest of the industry. If that continues, the constraint becomes industrial: chip supply, data-center buildout, and power, not just clever training tricks. It’s a useful reminder that “AI progress” is increasingly tied to the physical economy.

On the hardware supply front, Anthropic is reportedly in talks with Microsoft about using Microsoft’s Maia 200 AI chips. Nothing’s final, but the significance is straightforward: AI labs want diversification away from a single GPU supplier, and cloud providers want proprietary silicon to become a real competitive edge, not just an internal cost saver. For customers, this could eventually mean more options for where performance-per-dollar comes from—but also a more fragmented landscape where the best deal depends on which model runs best on which chip family.

Money, valuations, and the profitability debate are getting sharper. One report cited by Sherwood News claims OpenAI generated about 5.7 billion dollars in revenue in Q1, ahead of Anthropic’s reported 4.8 billion. At the same time, another report suggests Anthropic may be accelerating faster, with a much higher Q2 pace, and it’s reportedly fundraising at an eye-watering valuation that could surpass OpenAI’s last reported number. These figures matter not just as scoreboard watching—they shape pricing, partnership leverage, and the urgency around IPO narratives.

But a separate tracking project, “Is AI Profitable Yet?”, tries to puncture the celebratory tone by comparing cumulative AI spending versus estimated AI revenue across big tech and frontier labs. Its conclusion: the sector, collectively, still looks deep in the red—while Nvidia is portrayed as the standout winner because it sells the picks and shovels. You don’t have to accept every estimate to take the lesson: the AI boom is still heavily capex-driven, and we’re in a phase where infrastructure vendors may capture value earlier than many app-layer or model-layer players.

Private equity is also trying to turn AI into a repeatable enterprise playbook. A new AI enterprise services venture backed by Blackstone, Anthropic, and Hellman & Friedman reportedly made its first acquisition by bringing in Fractional AI as an operational hub—while ending Fractional AI’s partnership with OpenAI. The strategic angle is distribution: private equity controls thousands of midsize portfolio companies, and a services layer can standardize deployments, governance, and vendor choices at scale. For Anthropic, it’s a way to push Claude deeper into real business workflows without relying only on self-serve developer adoption.

A different kind of standardization is happening in the developer ecosystem too. Anomalyco launched Models.dev, an open-source, community-maintained database of model metadata across providers, exposed through a public API. This sounds mundane, but it’s increasingly necessary: model names, capabilities, tool support, and release dates change constantly, and teams need a reliable way to compare options and keep internal systems up to date. If it gains traction, it becomes “infrastructure for choosing infrastructure”—the kind of unglamorous layer that makes multi-model strategies less painful.

Not all the news is technical. A short manifesto-style page titled “Don’t quote the AI at me” is resonating because it calls out a social failure mode: responding to people by pasting unedited chatbot output. The argument isn’t anti-AI. It’s pro-accountability. If someone asks you, they want your judgment, your context, and your responsibility for the answer—not a generic blob you didn’t verify. In workplaces where AI is everywhere, this kind of norm-setting matters, because trust is a productivity tool too.

And speaking of trust and understanding what models are doing, Goodfire published research on interpretability that pushes beyond the usual “one neuron, one concept” storytelling. They looked at how sparse autoencoders behave when the underlying representations have curved geometry—think manifolds rather than tidy sliders. Their takeaway is that in real models, features often capture concepts in a diluted, overlapping way: locally meaningful, globally incomplete. They propose clustering features based on how they activate together, then analyzing the geometry of those clusters to recover richer concept structures. If that line of work holds up, it’s a step toward interpretability that scales—less like reading individual words, more like mapping the grammar.

Finally, geopolitics. Manus co-founders are reportedly exploring ways to comply with Beijing’s order to unwind Meta’s acquisition of the Chinese-founded agentic AI startup. Unwinding is messy when staff have moved, money changed hands, and tech has been integrated. But the bigger point is clear: cross-border AI deals are no longer just business transactions. They’re increasingly treated as strategic transfers of talent and capability. If regulators force more reversals like this, expect companies to structure acquisitions, partnerships, and data access with a lot more political risk in mind.

That’s our run for today. The through-line is pretty consistent: AI is getting more agentic, more embedded in real workflows, and more expensive in ways that force hard operational choices—from durable cloud execution to budgets, governance, and even etiquette. Links to all stories we covered are in the episode notes. Thanks for listening to The Automated Daily, AI News edition. I’m TrendTeller—see you tomorrow.