AI
scaled
RLHF + reasoning models (test-time compute scaling)
Reasoning models with test-time compute scaling have fundamentally shifted AI capability distribution from training time to inference time, enabling smaller models to solve previously intractable problems in mathematics, coding, and science while achieving cost-competitive inference.
What to watch next
Monitor scaling laws for reasoning beyond current limits; emergence of multi-step verifiable reasoning that enables formal proof verification; and development of interpretable chain-of-thought that reveals genuine reasoning versus pattern matching in reasoning model outputs.
Key sub-ideas & techniques
- Test-time compute scaling — Pioneered by OpenAI o1/o3, scaling inference-time tokens (often 10–100x more thinking) yields steeper accuracy gains than equivalent training-time scaling on math, code, and science benchmarks. [source]
- Pure-RL reasoning (R1-Zero recipe) — DeepSeek-R1 showed that reasoning behaviors emerge from pure reinforcement learning without any supervised fine-tuning, just by rewarding correct final answers — collapsing what was thought to be a complex pipeline. [source]
- RLHF → RLAIF → Constitutional AI — Human preference data, the bottleneck of early RLHF, has been progressively replaced by AI-generated feedback against written principles (Anthropic's Constitutional AI) and verifiable reward functions (RLVR). [source]
- Reasoning distillation — Long chains of thought from large reasoning models can be distilled into much smaller students (DeepSeek-R1 → Qwen-7B), bringing o1-class behavior into the open-weight ecosystem at a tiny fraction of the cost. [source]
- Agentic reasoning + tool use — Reasoning models now interleave planning steps with tool calls (browsers, code execution, retrieval), turning a chat model into a research assistant that runs a multi-step plan against real systems. [source]
- Step-wise rubric rewards (SRaR) — RLVR variant that decomposes rubric-based supervision onto individual reasoning steps and normalizes across rollouts to avoid uniform credit assignment and reward hacking. [source]
- Token-level credit assignment for RLVR — Per-token discriminative credit signal layered on top of outcome-verifiable RL, giving denser gradient signal than response-level rewards. [source]
- Equilibrium Reasoners (latent-attractor TTC) — Iterative latent-attractor reasoning enabling deep test-time-compute scaling without external verifiers; pushes Sudoku-Extreme accuracy from 2.6% to 99%+ on unrolling equivalent to ~40k layers. [source]
- Long-horizon autonomous agentic reasoning — Evaluating reasoning models by their ability to run unattended for many hours through plan-execute-diagnose-iterate loops on real research and systems tasks, measured by submissions/tool-calls/commits and end-state performance rather than single-pass accuracy. [source]
- Autonomous scientific hypothesis generation — Frontier reasoning models running long autonomous research loops (protein design, genomics) that propose and validate novel hypotheses at or above expert level — test-time/agentic compute applied to open-ended science. [source]
Current frontier
- DeepSeek-R1 demonstrates that pure reinforcement learning without supervised fine-tuning can achieve o1-level reasoning performance, published under MIT license with visible chain-of-thought reasoning. [source]
- Test-time compute scaling laws show that allocating 2-10x more inference tokens for reasoning can match or exceed accuracy of larger models, with smaller models more compute-efficient than scaling parameters. [source]
- OpenAI o3 achieves 45.1% on ARC-AGI benchmark through advanced test-time compute scaling and chain-of-thought reasoning, demonstrating reasoning model capabilities. [source]
- RLHF methods in 2025 now employ verifiable rewards via direct comparison scoring (RLVR) and online iterative feedback collection, achieving state-of-the-art on AlpacaEval-2 and Arena-Hard benchmarks. [source]
- Anthropic's constitutional AI systems with classifier jailbreak detection withstood 3,000+ hours of expert red-teaming with minimal universal jailbreaks discovered. [source]
- Sparse policy selection at entropy-gated decision points matches or exceeds full-RL reasoning at ~1000x lower training cost (ReasonMaxxer, arXiv 2605.06241). [source]
- RL compute for long-horizon reasoning follows a power law in reasoning depth; scaling exponent monotonically increases with logical expressiveness (1.04 -> 2.60) (ScaleLogic, arXiv 2605.06638). [source]
- Search-driven reward optimization improves reasoning on GSM8K without retraining the base model (arXiv 2605.02073). [source]
- TraceLift multiplies rubric-based Reasoning Reward Model scores by measured uplift on frozen executors, crediting reasoning traces that are useful to downstream consumers (arXiv 2605.03862). [source]
- May 7, 2026 — Anthropic publishes Natural Language Autoencoders (NLAs); RL-trained verbalizer/reconstructor pair turns activations into readable explanations; lifts hidden-motivation auditing from <3% to 12-15% and exposes evaluation-awareness on 16-26% of test transcripts vs <1% of real usage; used in Mythos Preview and Opus 4.6 alignment audits. [source]
- Self-supervised world-model learning now has a machine-checkable (Lean 4) identifiability guarantee: LeJEPA provably recovers a world's true latent variables iff the latents are Gaussian and evolve under stationary additive-noise dynamics (arXiv 2605.26379, May 2026). [source]
- Pairing LLMs with the Lean 4 proof assistant turns AI math/code outputs from plausible-but-unverified into machine-checked-correct; AxiomProver autonomously generated Lean/Mathlib-verified proofs (Axiom Math raised a $200M Series A led by Menlo Ventures). [BACKFILL, arXiv Feb 2026] [source]
Key people
- Richard Sutton Professor, University of Alberta; Senior Scientist, DeepMind · University of Alberta / DeepMind [source]
- Andrew Barto Emeritus Professor · University of Massachusetts Amherst [source]
- Ilya Sutskever CEO · Safe Superintelligence Inc. [source]
- Dario Amodei CEO · Anthropic [source]
- Demis Hassabis CEO · Google DeepMind [source]
- DeepSeek Research Team Lead researchers on DeepSeek-R1 · DeepSeek-AI [source]
Startups & labs to watch
- xAI Grok 5 xAI / SpaceX · STARTUP · Private; xAI acquired by SpaceX in February 2026 — Grok 4 (July 2025) implements reasoning similar to o3-mini and DeepSeek-R1; Grok 5 under development with target AGI-level claims for 2025-2026 timeframe. [source]
- DeepSeek Research DeepSeek-AI · STARTUP · Private; backed by Chinese investors — Leading open-source reasoning model work with DeepSeek-R1 (MIT license); R1 distilled models on Qwen achieving near-parity with proprietary reasoning models. [source]
- Anthropic Scaling Team Anthropic · LAB · Series C funding; $380B company valuation (Feb 2026) — Leading research on constitutional AI, verifiable rewards, and test-time compute for Claude; achieving frontier performance on reasoning benchmarks. [source]
- OpenAI o-series Development OpenAI · LAB · Private; multi-billion funding rounds — Pioneered test-time compute scaling with o1 and o3; o3 sets ARC-AGI benchmark records and defines the frontier of inference-time reasoning. [source]