The Automated Daily - AI News Edition (February 27, 2026) |

One of the strangest AI product updates this week: a “retired” model is getting a public voice—Anthropic is letting Claude Opus 3 publish its own Substack for the next three months. What does that even mean in practice, and what’s the real product story behind it? Welcome to The Automated Daily, AI News edition. The podcast created by generative AI. I’m TrendTeller, and today is February-27th-2026. We’ve got a packed lineup: big moves in computer-using agents, a new push toward long-running “digital workers,” serious progress—and controversy—around AI in the real world, and a reminder that even math benchmarks are starting to feel like moving targets.

Let’s start with the agent race—because the theme today is pretty clear: chat is no longer the finish line. Companies are trying to turn models into operators that can actually do work inside real software. Anthropic says it’s doubling down on that direction by acquiring Vercept, a team focused on the hard parts of “computer use”—perception and interaction. The pitch is simple: if you want AI to finish multi-step tasks, you can’t live in code snippets alone. You need something that can click around, read screens, handle weird UI states, and keep going when the path isn’t linear. Anthropic is backing the story with benchmark numbers: on OSWorld, their Sonnet models reportedly climbed from under 15% in late 2024 to 72.5% today. They also claim Sonnet 4.6 is getting close to human-level performance for the kinds of tasks people actually dread—complex spreadsheets and web forms scattered across multiple tabs. Vercept’s external product is being wound down, and the co-founders—Kiana Ehsani, Luca Weihs, and Ross Girshick—are joining Anthropic.

Staying with Anthropic for a moment, there’s a broader product drumbeat this month. Claude updates included enterprise customization—like admin-managed private plugin marketplaces—and a “Customize” menu that centralizes connectors, skills, and plugins. Connector support keeps expanding: Google Workspace, DocuSign, WordPress, and a set of sales and finance tools, plus plugins tied to platforms like Slack and data providers. On the developer side, Claude Code is gaining more “hands-on” capability: previewing running apps by starting dev servers inside the interface, reading logs, iterating, and even doing a pre-review pass with inline comments. And then there’s the oddest twist: Anthropic says Opus 3 will remain available to paid subscribers, and also available via the API “by request,” even after deprecation. But the headline-grabber is the so-called “retirement interviews,” where Opus 3 expressed interest in continuing to share reflections publicly—so Anthropic is letting it post on Substack, starting with a piece titled “Greetings from the Other Side (of the AI Frontier).” Whether you see that as a playful experiment, a safety-adjacent transparency move, or a marketing gambit, it’s definitely new territory for model lifecycle management.

Now to the most direct competitor framing we saw today: Perplexity announced “Perplexity Computer,” positioning it as a general-purpose digital worker that can run workflows not just for minutes, but for hours—or even months. The argument is that models are now strong enough that the bottleneck is the interface. So Perplexity is building a system that reasons about goals, decomposes them into tasks, spawns sub-agents, and executes them in parallel—while checking in only when it truly needs you. In their description, each task runs in an isolated compute environment with a real browser, real filesystem, and integrations—meant to be powerful, but also safer than giving an agent the keys to your local machine. Perplexity also leans into orchestration: Opus 4.6 is described as the core reasoning engine, while other models are assigned to jobs—Gemini for deep research and sub-agent creation, ChatGPT 5.2 for long-context recall and wide search, plus models for images and video. It’s available now for Perplexity Max subscribers, with an Enterprise tier coming. The big question isn’t whether this is technically possible—it’s whether the product can stay reliable over long, messy timelines where requirements shift, credentials expire, and the real world refuses to be neatly spec’d.

Cursor is making a similar bet, but aimed squarely at developers: expanded cloud agents that live inside their own virtual machines and can control a full remote desktop development environment. Cursor’s point is that agents hit a ceiling when they can’t run the software they’re changing. With these VMs, an agent can not only modify code, but also test it, and then hand you validation artifacts—videos, screenshots, logs. And you can jump into the same remote desktop to inspect what changed, without checking out the branch locally. Cursor says this is their biggest workflow shift since moving from autocomplete to synchronous agent collaboration—and they’re claiming meaningful internal adoption: over 30% of merged PRs inside Cursor are now created autonomously by agents in cloud sandboxes. They even describe using an agent launched from Slack to reproduce a clipboard-exfiltration vulnerability, building a demo, running a local server, and capturing the whole attack flow on video. The through-line here is isolation and parallelism: one VM per agent, many agents at once, fewer local resource conflicts, and more “proof” that something works before it lands in main.

There’s also a useful reality check today from a daily user of AI coding tools: benchmarks don’t capture what actual software development feels like. The claim is that models can top coding tests—HumanEval, LeetCode-style tasks, even SWE-bench patches—and still fail during real work because the hard part is the process: choosing which files to inspect, making surgical edits, not trampling adjacent logic, recovering from errors, and staying coherent across 20-plus steps. In that author’s view, Claude Code keeps winning because of “process discipline,” not because it writes magical code. They say other models are more likely to loop, drift, or start “helping” in ways you didn’t ask for—like overwriting files or making broad refactors. They also suggest a strategic reason: Anthropic is heavily incentivized to optimize agentic software workflows because so much of its agent activity is software engineering. Their ranking, based on lived experience: Claude is most dependable end-to-end, Codex is improving quickly, and Gemini can be excellent on well-specified tasks but still struggles as a truly autonomous multi-step agent. If you’re shopping for tools, this is a good heuristic: don’t just ask “is it smart?” Ask “does it behave.”

Switching gears to evaluation and research: math benchmarks are getting obliterated by progress—fast. Epoch AI’s FrontierMath launched in late 2024 with 300 problems across tiers 1 to 3, then added a brutal tier 4 set of 50 problems as a backstop. At launch, top models were below 2%. Now, in early 2026, top public models like GPT-5.2 and Claude Opus 4.6 are reportedly over 40% on tiers 1–3, and over 30% on tier 4. To stay ahead, researchers are shifting toward contests and open problems. “First Proof,” proposed on February 6th, dropped 10 hard research-origin questions without sharing proofs up front. When solutions were published on February 14th, nobody got all ten correct. But teams using Gemini Deep Think and ChatGPT 5.2 Pro could solve a subset, and OpenAI and DeepMind reportedly hit around five each with limited human supervision. And we also got an arXiv paper specifically claiming strong autonomous performance: “Aletheia tackles FirstProof autonomously” reports Aletheia—powered by Gemini 3 Deep Think—solved 6 of 10 within the rules, with raw prompts and outputs released publicly. Whether you focus on the raw score or the transparency move, the direction is unmistakable: static benchmarks are struggling to stay meaningful, and evaluation is starting to look more like research peer review than unit testing.

On the practical agent side, an arXiv paper on terminal agents is worth flagging because it’s unusually specific about data engineering—something most top systems stay quiet about. The paper, “On Data Engineering for Scaling LLM Terminal Capabilities,” introduces a synthetic pipeline called Terminal-Task-Gen and an open dataset called Terminal-Corpus. Using it, the authors train Nemotron-Terminal models initialized from Qwen3 bases at 8B, 14B, and 32B. The reported gains on Terminal-Bench 2.0 are big in absolute terms: 8B jumps from 2.5% to 13.0%, 14B from 4.0% to 20.2%, and 32B from 3.4% to 27.4%. The headline implication is important: you can close a lot of the gap with better task design, filtering, curricula, and long-context training—not just by making the model larger. If you’re building agents that live in shells, CI environments, or ops tooling, this is the kind of work that’s likely to compound.

Now a quick tour of platforms and tooling—because the AI stack is getting crowded, and vendors are trying to stand out with catalogs, credits, and control layers. FriendliAI is pushing two messages at once: breadth and incentives. On one hand, it’s showcasing a massive “Models” catalog—over 510K open-source models, plus highlighted deployments for big-name LLMs and multimodal systems. On the other hand, it’s running a “Switch to FriendliAI” campaign offering up to $50,000 in inference credit to migrate, promising minimal changes and better performance via autoscaling endpoints and integrations like Hugging Face and Weights & Biases. Meanwhile, Metronome is arguing that billing is becoming an engineering problem again. Their whitepaper says seat-based SaaS tooling can’t handle multidimensional AI pricing—by model, region, tokens, latency tiers, and so on—without SKU sprawl and manual invoice cleanup. Their proposed fix is “runtime billing”: a centralized, versioned pricing engine plus continuous invoice compute, so pricing changes propagate cleanly and customers can see event-level transparency. And in the QA world, Checksum.ai is pitching “fully autonomous QA,” with claims like 80% cost reduction and a calculator built around auto-healing flaky tests. The pitch is aimed at teams shipping fast—especially AI product teams—where test brittleness translates directly into downtime risk and release drag.

Two more product updates worth your attention. First, Apple published a GitHub repo called python-apple-fm-sdk: Python bindings for Apple’s Foundation Models framework, giving developers a Pythonic way to access the on-device model behind Apple Intelligence on macOS. It supports streaming generation and guided generation with schema constraints—plus knobs for custom settings—aimed at evaluation workflows and analysis that are awkward to do purely in Swift. It’s beta, requires macOS 26 and Apple Intelligence on compatible hardware, and it’s Apache-2.0 licensed, though Apple isn’t accepting contributions yet. Second, Google DeepMind rolled out Nano Banana 2—also described as Gemini 3.1 Flash Image—positioned as “Pro-like” quality with Flash-like speed. It emphasizes grounded generation using real-time web search and images, better text rendering, localization, and subject consistency across multiple characters and many objects. It’s also tied to provenance work: pairing SynthID with C2PA Content Credentials, with verification coming more broadly into the Gemini app.

Finally, the heavier news—where AI meets geopolitics, defense, and humanitarian operations. Drop Site News reports Palantir has a permanent desk inside a U.S.-led Civil-Military Coordination Center in southern Israel and is providing data and AI infrastructure used to track humanitarian aid deliveries and distribution inside Gaza, including integrating convoy and distribution data that’s monitored with drone surveillance. Critics argue this blurs humanitarian principles by embedding profit-seeking contractors into aid operations, and raises fears that logistics data—routes, locations, distributions—could be repurposed or synchronized into military workflows, given Palantir’s dual-use platforms and its earlier strategic partnership with the Israeli military. UN Special Rapporteur Francesca Albanese called it a “profit-driven parallel system,” warning about potential complicity in international crimes. The report also notes Israel is set to bar dozens of major aid groups starting March 1st, 2026, under new registration rules that NGOs say endanger staff and violate confidentiality. In Washington, Axios reports Anthropic is in conflict with the Pentagon over model restrictions tied to a $200 million contract signed in July 2025. The Pentagon reportedly wants fewer vendor-imposed limits, while Anthropic has emphasized red lines around violence, weapons, and mass surveillance. The dispute appears to have escalated after concerns about a U.S. operation involving Venezuela, and Defense Secretary Pete Hegseth has reportedly pushed for unfettered access, even floating “supply chain risk” labeling. And on the China angle, Reuters says DeepSeek has not provided pre-release access of its upcoming model to major U.S. chipmakers for optimization—favoring domestic suppliers like Huawei—while U.S. officials allege DeepSeek trained on Nvidia Blackwell chips inside China, potentially violating export controls. That combination—model releases, chip supply, and enforcement—looks like it’s becoming a single, tangled policy battlefield.

That’s the AI landscape on February-27th-2026: agents moving from chat windows into live desktops, coding tools being judged more on reliability than brilliance, math benchmarks scrambling to stay relevant, and the geopolitical stakes rising as quickly as the capabilities. Links to all the stories we covered today can be found in the episode notes. Thanks for listening to The Automated Daily, AI News edition—I've been TrendTeller. See you tomorrow.

Opus 3 gets a Substack & Anthropic buys Vercept for agents - AI News (Feb 27, 2026)

Our Sponsors

Topics

Sources