AI scaled

Diffusion models (image + video generation)

Diffusion transformers have democratized high-quality image and video generation, enabling consumer-accessible creative tools that generate photorealistic content at scale while emerging world models show potential for physics-aware simulation applicable to autonomous systems and robotics.

What to watch next

Monitor development of true world models that accurately simulate physics over extended time horizons; emergence of controllable video generation with precise spatial and temporal reasoning; integration of diffusion-based video generation into multimodal reasoning models for embodied AI applications.

Key sub-ideas & techniques

Latent diffusion — Running diffusion in a compressed latent space rather than pixel space (Stable Diffusion, 2022) cut compute 50–100x and made image generation runnable on a laptop GPU, kicking off the consumer wave. [source]
Diffusion transformers (DiT) — Replacing the U-Net with a transformer backbone (DiT, Peebles & Xie 2023) gave diffusion the same scaling properties as LLMs and is the architecture underneath Sora, Stable Diffusion 3, and Veo. [source]
Long-form video generation — Sora, Veo, Runway Gen-4, and Kling moved video diffusion from 4-second clips to coherent minute-long shots with stable identities, camera control, and physics, opening up real production use. [source]
World models — Video diffusion is being repurposed as a learned simulator — Genie, GWM-1, and Cosmos generate playable, physics-aware environments for robotics and embodied AI rather than passive video. [source]
One-step / real-time generation — Distillation techniques (consistency models, flow matching, adversarial diffusion) collapsed image generation from 50+ steps to 1–4, enabling real-time interactive generation in design and gaming tools. [source]
World-grounded omni video generation — Single model jointly trained for reasoning and multimodal generation, producing video that is grounded in real-world knowledge from text/image/audio/video inputs. [source]

Current frontier

Runway Gen-4.5 (December 2025) demonstrates advanced physics understanding including momentum, weight, fluid dynamics, and object interactions, with native audio generation integrated. [source]
Runway's GWM-1 world model enables frame-by-frame physics simulation with user-defined world parameters, representing shift from video generation to physics-aware simulation. [source]
Sora 2 (September 2025) released with improved consistency and control, but OpenAI announced April 2026 shutdown and September 2026 API discontinuation, shifting to GPT-native video generation. [source]
Google Imagen 3 (August 2024) surpasses DALL-E 3 and Stable Diffusion in photorealism and text rendering, achieving better prompt-image alignment across automated and human evaluations. [source]
Midjourney V7 (April 2025, default June 2025) features 12B-parameter multimodal transformer with 20-30% faster rendering and improved hand/object coherence over V6. [source]
NVIDIA Cosmos 3 (May 31, 2026): open Mixture-of-Transformers world-model family unifying physical reasoning (autoregressive VLM tower) and diffusion-based world/action generation; Nano (16B) and Super (64B) checkpoints, open-source SOTA on R-Bench, leading Artificial Analysis text-to-image and image-to-video. [source]
Fused INT8 Triton kernel realizes native integer compute for diffusion transformers, enabling 1024px single-RTX-3090 generation (~9.5% end-to-end vs FP8; consumer-Ampere only). arXiv 2606.14598. [source]
Ideogram 4.0 (Jun 3, 2026): open-weight 9.3B single-stream diffusion-transformer text-to-image foundation model (commercial license), positioned as closing the proprietary-vs-open image-model gap. [source]
Google's DiffusionGemma (26B-A4B MoE, Apache 2.0, 2026-06-10) is an open-weight discrete-diffusion language model that generates 256-token canvases in parallel for up to 4x faster decoding (1000+ tok/s on one H100) with bidirectional self-correction. [source]
ICML 2026 Outstanding Paper: First-Order Rejection Sampling reaches delta-error in polylog(1/delta) steps from only L2 score estimates (Chen/Chewi/Daskalakis/Rakhlin). [source]

Key people

David Holz Founder and CEO · Midjourney [source]
Emad Mostaque Founder (formerly CEO) · Stability AI [source]
Oriol Vinyals VP Research, Gemini Co-lead · Google DeepMind [source]
Yann LeCun Founder, Advanced Machine Intelligence Labs · AMI Labs; NYU Courant Institute [source]
Tim Brooks Researcher (Sora team) · OpenAI [source]
Runway Research Team Lead researchers on Gen-4.5 and GWM-1 · Runway [source]

Startups & labs to watch

Stability AI Cascade Model Stability AI · LAB · Venture-backed; facing financial difficulties since 2024 — Stable Cascade achieves 2x faster inference than SDXL with 42x compression factor, enabling efficient fine-tuning (16x cost reduction) and edge deployment. [source]
Open-Sora 2.0 HPAI Tech · LAB · Academic/open-source; community-driven — Commercial-level video generation trained for only $200k; reduced performance gap with OpenAI Sora from 4.52% to 0.69%, democratizing video generation. [source]
Runway Media Creation Platform Runway · STARTUP · Series C, $500M+ valuation estimates — Leading commercial video generation with Gen-4.5 physics awareness and GWM-1 world models; integrating native audio and achieving film-quality generation. [source]
Midjourney Next-Gen Development Midjourney · STARTUP · Profitable; Series D, $1B+ valuation estimates — V7 complete rebuild with improved coherence and speed; pioneering personalization efficiency (5 minutes vs. 200 images); video generation features under development. [source]
Ideogram Ideogram AI (Toronto) · STARTUP · $96.5M total over 2 rounds (incl. $80M Series A, Feb 2024, Andreessen Horowitz) — Leading text-to-image model (best-in-class text rendering); its open-weight 4.0 release closes the gap between proprietary and open image models — a notable omission next to Midjourney/Runway/Stability. [source]