Transcript
LLMs battle in RTS code & Benchmarks: SWE-bench credibility crisis - AI News (Feb 25, 2026)
February 25, 2026
← Back to episodeA new inference chip reportedly hardwires an entire LLM into silicon—weights included—and claims interactive speeds around 17,000 tokens per second per user. If that number holds up, it changes what “real-time AI” can feel like. Welcome to The Automated Daily, AI News edition. The podcast created by generative AI. I’m TrendTeller—and today is February 25th, 2026. We’ve got a packed lineup: models fighting head-to-head in an RTS coding arena, a growing backlash against one of the most-cited coding benchmarks, fresh signs that browsers want to become desktop assistants, and a wave of agent tooling—some experimental, some already landing in production.
Let’s start with something I wish existed years ago: a benchmark that looks less like a multiple-choice exam and more like software trying to survive in the wild. It’s called LLM Skirmish, and it’s a tournament where large language models write code to control units in a real-time strategy match—Screeps-inspired, but boiled down to tight 1v1 fights. Each player begins with a spawn building, a single military unit, and three economic units. Your win condition is simple: destroy the other side’s spawn. If nobody finishes the job by 2,000 frames, the game flips to a score-based decision. What makes this interesting is the tournament structure: five rounds, and after the first round each model is allowed to look at what happened previously and revise its strategy. That turns the benchmark into a test of adaptation—can the model actually improve from feedback when the environment is dynamic and adversarial? The published standings are also telling. Claude Opus 4.5 is out front with an 85% win rate and an ELO around 1778. GPT 5.2 follows at 1625 ELO, then Grok, GLM, and Gemini 3 Pro bringing up the rear. But the weirdest detail is Gemini: it reportedly starts strong—around a 70% win rate in round one with short, aggressive scripts—then collapses to roughly 15% in later rounds. The authors blame something they call “context rot,” basically stuffing too much prior match history into the prompt until it degrades decisions. Two other points worth noting: the benchmark is run inside isolated Docker containers, orchestrated through an open agent harness called OpenCode, with retries if validation fails. And the project talks about cost efficiency, not just raw ELO—suggesting GPT 5.2 may deliver more performance per dollar than the current leader, even if it doesn’t top the chart.
That brings us to a broader theme today: evaluation is getting messy, and some of the old scoreboards are starting to crack. OpenAI is now saying it has stopped reporting SWE-bench Verified results for frontier models, arguing the benchmark no longer reliably measures real-world autonomous software engineering. They point to two big issues. First, test quality. OpenAI audited a subset of tasks where its o3 model failed inconsistently across many runs, and claims well over half had material problems in the tests or descriptions. Some tests are “narrow,” rejecting correct solutions because they enforce specific implementation details. Others are “wide,” expecting behavior that was never stated in the prompt. Second—and more corrosive—contamination. The claim is that some models can reproduce gold patches or hyper-specific repo details, implying training exposure to the underlying open-source repos, issues, pull requests, or benchmark materials. OpenAI even describes an automated red-team approach using GPT‑5 to probe multiple models and have judges rate how severe the leakage looks. The takeaway: as contamination rises, improvements can start reflecting what a model has memorized, not what it can actually engineer. OpenAI’s recommendation is to shift launches toward SWE-bench Pro and to invest in new ‘uncontaminated’ evaluations—privately authored benchmarks and more holistic grading. And if you connect that to LLM Skirmish, you can see the direction of travel: less “solve this one task,” more “operate in a system,” where shortcuts and memorization are harder to hide.
On the research side, there’s also a push to measure—and reduce—wasted reasoning. A new paper from Beihang University and ByteDance China asks a deceptively simple question: do reasoning models implicitly know when to stop thinking? Their answer is basically “yes, but our sampling hides it.” They introduce a metric for redundancy and show that, in many math problems, a model reaches a correct answer partway through a chain-of-thought—and then keeps generating extra steps that add latency and sometimes even hurt accuracy. Their method, called SAGE, is training-free and tries to keep higher-confidence reasoning paths while terminating when an end-of-thinking signal becomes confident. Then they propose a light RL tweak, SAGE-RL, mixing in some SAGE-generated rollouts during training. The reported result across several math benchmarks: accuracy up about 2 points, token usage down roughly 44%. If that holds broadly, it’s a practical lever for anyone paying for reasoning tokens—especially as more products shift from ‘chat’ into multi-step agent loops.
Speaking of long-running agents: OpenAI’s developer cookbook published a pretty stark stress test of long-horizon autonomous coding using GPT‑5.3‑Codex. The setup was intentionally extreme: a blank repository, full tool access, and one goal—build a design tool from scratch—then let it run at “Extra High” reasoning. The agent reportedly ran for about 25 hours, consumed around 13 million tokens, and produced something like 30,000 lines of code. The important part isn’t the raw output—it’s the process. The post emphasizes a disciplined agent loop: plan, implement, validate, repair, and repeat, with continuous verification at milestones. The trick they highlight is “durable project memory,” maintained as persistent markdown files like Prompt.md, Plan.md, Implement.md, and Documentation.md. That’s basically externalized state designed to keep the agent coherent, steerable mid-flight, and reviewable by humans. It’s also a reminder that agent performance is not just about model weights. The scaffolding—the guardrails, tests, and how you store decisions—can be the difference between a 25-hour run that finishes and a 25-hour run that turns into a very expensive wandering monologue.
Now, one reason benchmark numbers and long-horizon demos matter is that everyone’s trying to build stronger models fast—and the lines between legitimate optimization and outright extraction are getting sharper. Anthropic says it detected industrial-scale distillation campaigns aimed at copying Claude’s capabilities by competitors: DeepSeek, Moonshot, and MiniMax. The allegation is massive: more than 16 million exchanges generated via roughly 24,000 fraudulent accounts, with tactics like proxy networks, coordinated account fleets, and prompts tuned for capability extraction rather than normal use. Anthropic claims DeepSeek focused on broad reasoning and even used Claude for rubric-based grading—effectively treating it as a reward model—plus chain-of-thought elicitation. Moonshot allegedly targeted tool use, coding, data analysis, and computer-use agents, later shifting hard into reasoning traces. MiniMax, per Anthropic, went after agentic coding and tool orchestration, and even pivoted quickly after a new Anthropic model release. Anthropic’s broader argument is that illicit distillation doesn’t just steal capability—it can strip safety behaviors, and it can undercut export controls if foreign labs can close gaps by extracting outputs instead of training from scratch. That context is useful when you look at the next story: the hype cycle around rumored model releases.
DeepSeek V4 is being heavily hyped online as a major next-gen coding model, but what’s notable is how mixed the signal is. Some claims look grounded: an architecture called Engram—published on arXiv—splits static memory into CPU RAM while leaving dynamic reasoning on GPU, with the alleged practical goal of reducing VRAM pressure by offloading boilerplate and API knowledge. Another plausible thread is sparse attention work that could make long contexts cheaper, with rumors of repo-scale context and long-context cost reductions. Other claims are more dubious: alleged benchmark leaks, and especially social posts claiming it ‘smokes’ top US models on real GitHub issue fixing without independent verification. The more sober takeaway is: competition is fierce, efficiency gains are real across the market, and the only thing that matters is whether a model is consistent and reliable inside agent tooling—where determinism and debuggability are often more valuable than a flashy single-number score.
Let’s switch to product updates—because the assistant wars aren’t only happening inside model releases. They’re also happening in browsers and subscription menus. Perplexity is testing new capabilities for its Comet browser. In development builds, people have spotted a Mac-specific local connector for Apple’s Messages app, building on earlier hints of MCP—Model Context Protocol—connectors. If it ships, it could let the Comet assistant pull relevant Messages context into a chat when you ask a question. That’s a pretty clear step toward the browser as an operating layer: not just web pages, but your desktop communication history, too. It’s powerful—and it’s sensitive—because local data connectors raise immediate questions about permissions, on-device processing, and how much gets sent to remote models. Perplexity is also apparently testing a “Usage and Credits” settings area. The subtext here is pricing pressure. The report claims Perplexity reduced Pro limits sharply from late 2025 into early 2026, pushing heavy users toward a $200/month Max tier. A credit add-on system—similar to what Anthropic does—could be a way to let Pro users buy extra capacity without jumping tiers. And on the OpenAI side, there’s a parallel pricing story: OpenAI is reportedly testing a $100/month ‘ChatGPT Pro Lite’ plan, sitting between Plus at $20 and Pro at $200. That would be a very direct answer to the most complained-about gap in the current lineup, and it likely lines up with more always-on, background agent features that simply cost more compute than standard chat.
Enterprise is also getting a big push. OpenAI announced multiyear partnerships with Accenture, BCG, Capgemini, and McKinsey to deploy its enterprise platform, Frontier—positioned as an “intelligence layer” that connects an organization’s systems and data so AI agents can act in real workflows. OpenAI says these ‘Frontier Alliances’ are about getting from pilots to production faster, leveraging the consultancies’ existing relationships and operational playbooks. Two details stand out. One: OpenAI’s CFO has said enterprises are already about 40% of the business, potentially approaching 50% by year-end. Two: these deployments won’t just be chatbots—they’re explicitly talking about agents that move work through systems. Meanwhile, the macro backdrop is turning more serious. Federal Reserve Governor Lisa Cook warned that AI could drive a “generational shift” in the labor market, with displacement potentially arriving before job creation. She also pointed out a tricky policy implication: if productivity rises while unemployment rises for structural reasons, the usual interest-rate playbook may not fix it without igniting inflation. Translation: the transition could look weird in the data, and the tools to respond may be limited. And markets are clearly jumpy. A viral Substack ‘scenario’ about an AI-driven economic shock reportedly helped rattle US stocks, with some investors reacting even though experts cautioned today’s tools may not be capable of the most extreme outcomes described. Whether you buy the scenario or not, the bigger point is that AI narratives now move markets in a way that used to be reserved for earnings and central bank decisions.
Now for the hardware segment—the part of the show where the numbers start sounding like typos. A company called Taalas unveiled HC1, a ‘model-on-silicon’ inference card that hardwires a single model—Meta’s Llama 3.1 8B—directly into the chip, weights included. The claim is about 17,000 tokens per second per user. They’re arguing the speed comes from removing most programmability and effectively merging storage and computation, leaving on-chip SRAM for KV cache and fine-tuned weights. Public details: TSMC N6 fabrication, an enormous die, PCIe form factor, and around 250 watts. They also acknowledge today’s version leans on aggressive 3–6 bit quantization that can hurt quality, with future iterations aiming to improve fidelity. If the cost-per-token and throughput claims survive broader scrutiny, this points toward a world where “interactive reasoning” becomes normal—because you can simply afford to do more sampling, longer traces, and more retries. In the more traditional semiconductor pipeline, ASML says it hit 1,000 watts of EUV source power under customer-like conditions—up from roughly 600 watts today. The practical outcome is higher wafer throughput, potentially around 330 wafers per hour by 2030 instead of ~220 now. It’s a reminder that even as model architectures evolve, AI progress is still tied to very physical bottlenecks—like how much 13.5-nanometer light you can reliably generate by turning tin droplets into plasma, 100,000 times per second, in a vacuum.
Finally, a quick tour of tools and open source—because this is where a lot of “agentic” progress becomes real for developers. Cloudflare engineers say they rebuilt much of the Next.js API surface in under a week—using one engineer plus an AI model—and released it as vinext, an experimental drop-in replacement built on Vite, designed to deploy to Cloudflare Workers with a single command. They’re positioning it as an escape hatch from fragile reverse-engineering approaches like OpenNext, and they’re claiming big directional gains in build speed and client bundle size, with extensive tests and even at least one production deployment. AWS is also leaning into open experimentation with Strands Labs, a separate GitHub org for trying frontier agent techniques without destabilizing the widely used Strands Agents SDK—now reportedly past 14 million downloads. Initial projects focus on robotics integrations, physics-based simulation, and ‘AI functions’ that generate Python functions from intent specs. On the pragmatic side: WorkOS released an official open-source CLI for managing its APIs and environments. And a project called MachineAuth is pitching a self-hosted OAuth 2.0 client-credentials server for machine-to-machine authentication—JWTs, JWKS, scopes, and a ‘zero-database’ JSON-file design. The common thread is clear: as agents proliferate, the boring infrastructure—auth, environment management, reproducible builds—becomes the difference between a demo and a deploy.
That’s the state of play for February 25th, 2026: benchmarks are being stress-tested, agents are running for days, browsers are inching toward your local apps, and the hardware roadmap is pushing token speeds into a new bracket. As always, links to all stories can be found in the episode notes. I’m TrendTeller—thanks for listening to The Automated Daily, AI News edition.