AI News · February 25, 2026 · 14:34

LLMs battle in RTS code & Benchmarks: SWE-bench credibility crisis - AI News (Feb 25, 2026)

Hardwired LLM chips at 17k tok/s, SWE-bench controversy, RTS LLM battles, Comet local connectors, distillation attacks, and long-horizon Codex agents.

LLMs battle in RTS code & Benchmarks: SWE-bench credibility crisis - AI News (Feb 25, 2026)
0:0014:34

Our Sponsors

Today's AI News Topics

  1. 01

    LLMs battle in RTS code

    — LLM Skirmish pits models in 1v1 RTS matches using Screeps-style code, tracking ELO, win rates, and in-tournament adaptation as a practical in-context learning benchmark.
  2. 02

    Benchmarks: SWE-bench credibility crisis

    — OpenAI says SWE-bench Verified is no longer reliable due to flawed tests and training contamination, urging the shift to SWE-bench Pro and new private, holistic evaluations.
  3. 03

    Efficient reasoning: stop thinking

    — A Beihang/ByteDance paper proposes SAGE and SAGE-RL to cut redundant chain-of-thought, using end-of-thinking signals to reduce tokens ~44% while improving math accuracy.
  4. 04

    Long-horizon agentic coding

    — OpenAI’s cookbook stress test shows GPT-5.3-Codex running ~25 hours, consuming ~13M tokens, and building a large design tool with “durable project memory” files and guardrails.
  5. 05

    Distillation attacks on Claude

    — Anthropic reports industrial-scale illicit distillation by DeepSeek, Moonshot, and MiniMax via thousands of fraudulent accounts, targeting tool use, coding, and reasoning traces.
  6. 06

    DeepSeek V4 hype signals

    — Community chatter around DeepSeek V4 mixes real research (Engram memory split, sparse attention) with shaky leaks on benchmarks and pricing; the key question is real-world reliability.
  7. 07

    AI in browsers and pricing

    — Perplexity’s Comet explores MCP-based local connectors (including Apple Messages) and a “Usage and Credits” page, while OpenAI is reportedly testing a $100 ChatGPT Pro Lite tier.
  8. 08

    Enterprise alliances and labor shifts

    — OpenAI forms ‘Frontier Alliances’ with major consultancies to deploy agents in enterprises, as the Fed warns AI may raise near-term unemployment and complicate rate policy.
  9. 09

    New chips and EUV advances

    — Taalas claims a ‘model-on-silicon’ card hardwiring Llama 3.1 8B at ~17k tok/s per user, while ASML boosts EUV source power toward higher wafer throughput by 2030.
  10. 10

    Open-source tools for agents

    — Cloudflare’s AI-assisted vinext reimplements much of the Next.js API on Vite for Workers, alongside new OSS utilities like AWS Strands Labs, WorkOS CLI, and MachineAuth for M2M OAuth.

Sources & AI News References

Full Episode Transcript: LLMs battle in RTS code & Benchmarks: SWE-bench credibility crisis

A new inference chip reportedly hardwires an entire LLM into silicon—weights included—and claims interactive speeds around 17,000 tokens per second per user. If that number holds up, it changes what “real-time AI” can feel like. Welcome to The Automated Daily, AI News edition. The podcast created by generative AI. I’m TrendTeller—and today is February 25th, 2026. We’ve got a packed lineup: models fighting head-to-head in an RTS coding arena, a growing backlash against one of the most-cited coding benchmarks, fresh signs that browsers want to become desktop assistants, and a wave of agent tooling—some experimental, some already landing in production.

LLMs battle in RTS code

Let’s start with something I wish existed years ago: a benchmark that looks less like a multiple-choice exam and more like software trying to survive in the wild. It’s called LLM Skirmish, and it’s a tournament where large language models write code to control units in a real-time strategy match—Screeps-inspired, but boiled down to tight 1v1 fights. Each player begins with a spawn building, a single military unit, and three economic units. Your win condition is simple: destroy the other side’s spawn. If nobody finishes the job by 2,000 frames, the game flips to a score-based decision. What makes this interesting is the tournament structure: five rounds, and after the first round each model is allowed to look at what happened previously and revise its strategy. That turns the benchmark into a test of adaptation—can the model actually improve from feedback when the environment is dynamic and adversarial? The published standings are also telling. Claude Opus 4.5 is out front with an 85% win rate and an ELO around 1778. GPT 5.2 follows at 1625 ELO, then Grok, GLM, and Gemini 3 Pro bringing up the rear. But the weirdest detail is Gemini: it reportedly starts strong—around a 70% win rate in round one with short, aggressive scripts—then collapses to roughly 15% in later rounds. The authors blame something they call “context rot,” basically stuffing too much prior match history into the prompt until it degrades decisions. Two other points worth noting: the benchmark is run inside isolated Docker containers, orchestrated through an open agent harness called OpenCode, with retries if validation fails. And the project talks about cost efficiency, not just raw ELO—suggesting GPT 5.2 may deliver more performance per dollar than the current leader, even if it doesn’t top the chart.

Benchmarks: SWE-bench credibility crisis

That brings us to a broader theme today: evaluation is getting messy, and some of the old scoreboards are starting to crack. OpenAI is now saying it has stopped reporting SWE-bench Verified results for frontier models, arguing the benchmark no longer reliably measures real-world autonomous software engineering. They point to two big issues. First, test quality. OpenAI audited a subset of tasks where its o3 model failed inconsistently across many runs, and claims well over half had material problems in the tests or descriptions. Some tests are “narrow,” rejecting correct solutions because they enforce specific implementation details. Others are “wide,” expecting behavior that was never stated in the prompt. Second—and more corrosive—contamination. The claim is that some models can reproduce gold patches or hyper-specific repo details, implying training exposure to the underlying open-source repos, issues, pull requests, or benchmark materials. OpenAI even describes an automated red-team approach using GPT‑5 to probe multiple models and have judges rate how severe the leakage looks. The takeaway: as contamination rises, improvements can start reflecting what a model has memorized, not what it can actually engineer. OpenAI’s recommendation is to shift launches toward SWE-bench Pro and to invest in new ‘uncontaminated’ evaluations—privately authored benchmarks and more holistic grading. And if you connect that to LLM Skirmish, you can see the direction of travel: less “solve this one task,” more “operate in a system,” where shortcuts and memorization are harder to hide.

Efficient reasoning: stop thinking

On the research side, there’s also a push to measure—and reduce—wasted reasoning. A new paper from Beihang University and ByteDance China asks a deceptively simple question: do reasoning models implicitly know when to stop thinking? Their answer is basically “yes, but our sampling hides it.” They introduce a metric for redundancy and show that, in many math problems, a model reaches a correct answer partway through a chain-of-thought—and then keeps generating extra steps that add latency and sometimes even hurt accuracy. Their method, called SAGE, is training-free and tries to keep higher-confidence reasoning paths while terminating when an end-of-thinking signal becomes confident. Then they propose a light RL tweak, SAGE-RL, mixing in some SAGE-generated rollouts during training. The reported result across several math benchmarks: accuracy up about 2 points, token usage down roughly 44%. If that holds broadly, it’s a practical lever for anyone paying for reasoning tokens—especially as more products shift from ‘chat’ into multi-step agent loops.

Long-horizon agentic coding

Speaking of long-running agents: OpenAI’s developer cookbook published a pretty stark stress test of long-horizon autonomous coding using GPT‑5.3‑Codex. The setup was intentionally extreme: a blank repository, full tool access, and one goal—build a design tool from scratch—then let it run at “Extra High” reasoning. The agent reportedly ran for about 25 hours, consumed around 13 million tokens, and produced something like 30,000 lines of code. The important part isn’t the raw output—it’s the process. The post emphasizes a disciplined agent loop: plan, implement, validate, repair, and repeat, with continuous verification at milestones. The trick they highlight is “durable project memory,” maintained as persistent markdown files like Prompt.md, Plan.md, Implement.md, and Documentation.md. That’s basically externalized state designed to keep the agent coherent, steerable mid-flight, and reviewable by humans. It’s also a reminder that agent performance is not just about model weights. The scaffolding—the guardrails, tests, and how you store decisions—can be the difference between a 25-hour run that finishes and a 25-hour run that turns into a very expensive wandering monologue.

Distillation attacks on Claude

Now, one reason benchmark numbers and long-horizon demos matter is that everyone’s trying to build stronger models fast—and the lines between legitimate optimization and outright extraction are getting sharper. Anthropic says it detected industrial-scale distillation campaigns aimed at copying Claude’s capabilities by competitors: DeepSeek, Moonshot, and MiniMax. The allegation is massive: more than 16 million exchanges generated via roughly 24,000 fraudulent accounts, with tactics like proxy networks, coordinated account fleets, and prompts tuned for capability extraction rather than normal use. Anthropic claims DeepSeek focused on broad reasoning and even used Claude for rubric-based grading—effectively treating it as a reward model—plus chain-of-thought elicitation. Moonshot allegedly targeted tool use, coding, data analysis, and computer-use agents, later shifting hard into reasoning traces. MiniMax, per Anthropic, went after agentic coding and tool orchestration, and even pivoted quickly after a new Anthropic model release. Anthropic’s broader argument is that illicit distillation doesn’t just steal capability—it can strip safety behaviors, and it can undercut export controls if foreign labs can close gaps by extracting outputs instead of training from scratch. That context is useful when you look at the next story: the hype cycle around rumored model releases.

DeepSeek V4 hype signals

DeepSeek V4 is being heavily hyped online as a major next-gen coding model, but what’s notable is how mixed the signal is. Some claims look grounded: an architecture called Engram—published on arXiv—splits static memory into CPU RAM while leaving dynamic reasoning on GPU, with the alleged practical goal of reducing VRAM pressure by offloading boilerplate and API knowledge. Another plausible thread is sparse attention work that could make long contexts cheaper, with rumors of repo-scale context and long-context cost reductions. Other claims are more dubious: alleged benchmark leaks, and especially social posts claiming it ‘smokes’ top US models on real GitHub issue fixing without independent verification. The more sober takeaway is: competition is fierce, efficiency gains are real across the market, and the only thing that matters is whether a model is consistent and reliable inside agent tooling—where determinism and debuggability are often more valuable than a flashy single-number score.

AI in browsers and pricing

Let’s switch to product updates—because the assistant wars aren’t only happening inside model releases. They’re also happening in browsers and subscription menus. Perplexity is testing new capabilities for its Comet browser. In development builds, people have spotted a Mac-specific local connector for Apple’s Messages app, building on earlier hints of MCP—Model Context Protocol—connectors. If it ships, it could let the Comet assistant pull relevant Messages context into a chat when you ask a question. That’s a pretty clear step toward the browser as an operating layer: not just web pages, but your desktop communication history, too. It’s powerful—and it’s sensitive—because local data connectors raise immediate questions about permissions, on-device processing, and how much gets sent to remote models. Perplexity is also apparently testing a “Usage and Credits” settings area. The subtext here is pricing pressure. The report claims Perplexity reduced Pro limits sharply from late 2025 into early 2026, pushing heavy users toward a $200/month Max tier. A credit add-on system—similar to what Anthropic does—could be a way to let Pro users buy extra capacity without jumping tiers. And on the OpenAI side, there’s a parallel pricing story: OpenAI is reportedly testing a $100/month ‘ChatGPT Pro Lite’ plan, sitting between Plus at $20 and Pro at $200. That would be a very direct answer to the most complained-about gap in the current lineup, and it likely lines up with more always-on, background agent features that simply cost more compute than standard chat.

Enterprise alliances and labor shifts

Enterprise is also getting a big push. OpenAI announced multiyear partnerships with Accenture, BCG, Capgemini, and McKinsey to deploy its enterprise platform, Frontier—positioned as an “intelligence layer” that connects an organization’s systems and data so AI agents can act in real workflows. OpenAI says these ‘Frontier Alliances’ are about getting from pilots to production faster, leveraging the consultancies’ existing relationships and operational playbooks. Two details stand out. One: OpenAI’s CFO has said enterprises are already about 40% of the business, potentially approaching 50% by year-end. Two: these deployments won’t just be chatbots—they’re explicitly talking about agents that move work through systems. Meanwhile, the macro backdrop is turning more serious. Federal Reserve Governor Lisa Cook warned that AI could drive a “generational shift” in the labor market, with displacement potentially arriving before job creation. She also pointed out a tricky policy implication: if productivity rises while unemployment rises for structural reasons, the usual interest-rate playbook may not fix it without igniting inflation. Translation: the transition could look weird in the data, and the tools to respond may be limited. And markets are clearly jumpy. A viral Substack ‘scenario’ about an AI-driven economic shock reportedly helped rattle US stocks, with some investors reacting even though experts cautioned today’s tools may not be capable of the most extreme outcomes described. Whether you buy the scenario or not, the bigger point is that AI narratives now move markets in a way that used to be reserved for earnings and central bank decisions.

New chips and EUV advances

Now for the hardware segment—the part of the show where the numbers start sounding like typos. A company called Taalas unveiled HC1, a ‘model-on-silicon’ inference card that hardwires a single model—Meta’s Llama 3.1 8B—directly into the chip, weights included. The claim is about 17,000 tokens per second per user. They’re arguing the speed comes from removing most programmability and effectively merging storage and computation, leaving on-chip SRAM for KV cache and fine-tuned weights. Public details: TSMC N6 fabrication, an enormous die, PCIe form factor, and around 250 watts. They also acknowledge today’s version leans on aggressive 3–6 bit quantization that can hurt quality, with future iterations aiming to improve fidelity. If the cost-per-token and throughput claims survive broader scrutiny, this points toward a world where “interactive reasoning” becomes normal—because you can simply afford to do more sampling, longer traces, and more retries. In the more traditional semiconductor pipeline, ASML says it hit 1,000 watts of EUV source power under customer-like conditions—up from roughly 600 watts today. The practical outcome is higher wafer throughput, potentially around 330 wafers per hour by 2030 instead of ~220 now. It’s a reminder that even as model architectures evolve, AI progress is still tied to very physical bottlenecks—like how much 13.5-nanometer light you can reliably generate by turning tin droplets into plasma, 100,000 times per second, in a vacuum.

Open-source tools for agents

Finally, a quick tour of tools and open source—because this is where a lot of “agentic” progress becomes real for developers. Cloudflare engineers say they rebuilt much of the Next.js API surface in under a week—using one engineer plus an AI model—and released it as vinext, an experimental drop-in replacement built on Vite, designed to deploy to Cloudflare Workers with a single command. They’re positioning it as an escape hatch from fragile reverse-engineering approaches like OpenNext, and they’re claiming big directional gains in build speed and client bundle size, with extensive tests and even at least one production deployment. AWS is also leaning into open experimentation with Strands Labs, a separate GitHub org for trying frontier agent techniques without destabilizing the widely used Strands Agents SDK—now reportedly past 14 million downloads. Initial projects focus on robotics integrations, physics-based simulation, and ‘AI functions’ that generate Python functions from intent specs. On the pragmatic side: WorkOS released an official open-source CLI for managing its APIs and environments. And a project called MachineAuth is pitching a self-hosted OAuth 2.0 client-credentials server for machine-to-machine authentication—JWTs, JWKS, scopes, and a ‘zero-database’ JSON-file design. The common thread is clear: as agents proliferate, the boring infrastructure—auth, environment management, reproducible builds—becomes the difference between a demo and a deploy.

That’s the state of play for February 25th, 2026: benchmarks are being stress-tested, agents are running for days, browsers are inching toward your local apps, and the hardware roadmap is pushing token speeds into a new bracket. As always, links to all stories can be found in the episode notes. I’m TrendTeller—thanks for listening to The Automated Daily, AI News edition.