New Benchmark Humbles AI; Google Shrinks Memory; Reddit Bot ID Check

New Benchmark Humbles AI; Google Shrinks Memory; Reddit Bot ID Check

Today's AI Outlook: 🌤️

Key Benchmark Reveals Frontier Models Far From AGI?

For the past year, AI labs have been happily turning benchmark charts into victory laps. Then ARC-AGI-3 showed up and basically told the whole sector to put the champagne back in the fridge.

The new benchmark from François Chollet’s ARC Prize Foundation is built as an interactive reasoning test where humans can solve 100% of tasks on the first try, but top AI systems are not even sniffing 1%.

That matters because labs had already spent millions training against earlier ARC versions, pushing ARC-AGI-2 scores from roughly 3% to about 50% in less than a year. ARC-AGI-3 is designed to make that kind of brute-force progress much harder. These are game-like scenarios with no instructions, so models have to figure out the rules, form goals, and plan from scratch. In other words, no multiple-choice test-taking hacks, no benchmark cramming, no elegant spreadsheet bragging.

Why It Matters

This is the cleanest reminder in a while that frontier AI is still very uneven at general reasoning. The hype cycle says we are inches from AGI. ARC-AGI-3 says the ruler might be broken. At the same time, history suggests labs will attack this benchmark fast, which means the next six months could become a live stress test for whether progress comes from real reasoning gains or just more expensive optimization.

The Deets

  • Gemini Pro led frontier models with 0.37%
  • GPT 5.4 High scored 0.26%
  • Opus 4.6 scored 0.25%
  • Grok-4.20 landed at 0%
  • The challenge comes with a $1M prize

Key Takeaway

ARC-AGI-3 did not prove AGI is far away forever. It did prove that today’s best models still crack in surprisingly basic-looking settings when they cannot lean on prior patterns, canned instructions, or benchmark familiarity.

đź§© Jargon Buster - Interactive Reasoning Benchmark: A test where the model has to discover the rules and solve the task as it goes, instead of just predicting the most likely next answer from familiar examples.


🏛️ Power Plays

Reddit Wants To Know If You’re Human

Reddit CEO Steve Huffman laid out a plan to better separate humans from bots on the platform, and notably, he is trying to do it without turning the site into a TSA checkpoint. Approved automated accounts would be labeled [App], suspicious accounts could be asked to verify they are human, and subreddits would keep the power to set their own rules around AI-generated content.

The balancing act here is obvious. Reddit wants to slow bot creep without triggering a user revolt over privacy. So the verification menu starts with passkeys or Sam Altman’s World ID scanner, while government IDs are described as a last resort where laws require them. Huffman also made clear that AI-written content is not banned, which means the moderation burden still falls heavily on communities themselves.

Why It Matters

The internet’s bot problem is no longer a weird corner-case theory. Platforms are now dealing with automated behavior at a scale that threatens trust, moderation, and basic usability. Reddit’s move looks more like triage than cure, but it also signals a larger shift: every major social platform is going to need a durable human-first identity layer.

The Deets

  • Approved automated accounts will get an [App] label
  • Suspicious accounts may face human verification
  • Reddit plans to use passkeys or World ID
  • Government ID is positioned as a last resort
  • The backdrop is ugly: Digg folded after bot issues, and Cloudflare data cited by The Rundown shows automated traffic is on pace to surpass humans by 2027

Key Takeaway

Reddit is trying to thread the needle between authenticity and privacy, but this is the opening chapter of a much larger platform fight over who is real, who is synthetic, and who gets to decide.

đź§© Jargon Buster - Passkey: A passwordless sign-in method tied to your device or biometric login, designed to verify identity more securely than a traditional password.


đź’° Funding & Startups

Robot Dreams: Unitree Files For IPO

Unitree has filed for a Shanghai STAR Market IPO, posting about $230M in revenue and roughly $80M in profit while reportedly targeting a valuation near $7B.

On paper, that is the sort of robotics story public markets love. Under the hood, though, AI Secret notes a very large share of that revenue is tied to research labs, government programs, and staged enterprise deployments rather than repeatable commercial demand.

That distinction matters because controlled demos, exhibitions, and pilot programs can make a robotics company look farther along than it really is. The numbers may be real, but the durability of the demand is much less certain if buyers are spending allocated budgets rather than buying because the robots are already generating clear returns.

Why It Matters

Robotics investors have spent years waiting for a breakout commercial story. The risk is that markets price narrative first and recurring economics later. If demand does not graduate from lab and pilot environments into real productivity use cases, those margins could get very fragile very quickly.

The Deets

  • Unitree reported about $230M in revenue
  • It posted roughly $80M in profit
  • The IPO could target a valuation near $7B
  • 73.6% of revenue reportedly comes from research labs, government programs, and staged enterprise deployments
  • More than half of deployments are said to be in controlled environments such as exhibitions, pilot programs, and lab settings

Key Takeaway

Unitree may be a real robotics contender, but the filing reads less like mass-market proof and more like a reminder that not all revenue is created equal.

đź§© Jargon Buster - Non-Recurring Revenue: Revenue that comes from one-time projects, pilots, or special contracts rather than repeat purchases that reliably continue over time.


🔬 Research & Models

Google’s TurboQuant Shrinks AI Memory Concerns

Google Research introduced TurboQuant, a compression algorithm that shrinks model memory by more than 6x without retraining and with almost no accuracy loss. It also delivered up to 8x speed gains on Nvidia H100 chips. That is the sort of research result that sounds niche until you realize it hits one of the nastiest bottlenecks in modern AI systems: the memory burden that grows as conversations and tasks get longer.

The Rundown reported on how models keep a running log of conversation context, which slows responses and raises costs as that storage balloons. AI Secret translated that into the agent era: long-running agents accumulate working memory across writing, searching, editing, and publishing, with the KV cache consuming 30% to 70% of inference memory on extended tasks. Compressing that without breaking performance is a very big deal.

Why It Matters

This is a direct attack on one of the limits holding back longer-context assistants and multi-step agents. If adopted broadly, it could let AI systems run longer chains, handle deeper workflows, and maintain coherence across bigger tasks without costs rising at the same pace.

The Deets

  • TurboQuant compresses AI model memory by 6x+
  • It does so with zero or near-zero accuracy loss
  • It can speed inference up to 8x on Nvidia H100 chips

Key Takeaway

TurboQuant is the kind of infrastructure advance that could quietly unlock the next wave of useful agents. It is not flashy, but it attacks a real constraint, and those are the breakthroughs that tend to age best.

đź§© Jargon Buster - KV Cache: The memory store a language model uses to keep track of prior tokens in a conversation or task so it can respond with context instead of starting from scratch each time.


đź§° Tools Of The Day

Lyria 3 Pro - Google’s upgraded AI music model is pushing longer track outputs, which makes it more useful for creators who need something closer to full composition instead of a musical appetizer.

MolmoWeb - Ai2’s open-source web-browsing agent is part of the growing push to make agents actually navigate and act online instead of just talking about it.

Uni-1 - Luma’s unified model aims to reason and generate across text and images, which is where a lot of multimodal product design is headed.

Composer 2 - Cursor’s coding model is pitched as both powerful and cost-effective, which is exactly the combo developers tend to notice before the marketing team does.

Higgsfield - A lightweight creative tool with a practical office-use case today: making branded reaction GIFs without turning your design team into a meme factory.


Today’s Sources: The Internet, The Rundown AI, AI Secret

Subscribe to AI Slop

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe