AI
scaled
Diffusion models (image + video generation)
Diffusion transformers have democratized high-quality image and video generation, enabling consumer-accessible creative tools that generate photorealistic content at scale while emerging world models show potential for physics-aware simulation applicable to autonomous systems and robotics.
What to watch next
Monitor development of true world models that accurately simulate physics over extended time horizons; emergence of controllable video generation with precise spatial and temporal reasoning; integration of diffusion-based video generation into multimodal reasoning models for embodied AI applications.
Key sub-ideas & techniques
- Latent diffusion — Running diffusion in a compressed latent space rather than pixel space (Stable Diffusion, 2022) cut compute 50–100x and made image generation runnable on a laptop GPU, kicking off the consumer wave. [source]
- Diffusion transformers (DiT) — Replacing the U-Net with a transformer backbone (DiT, Peebles & Xie 2023) gave diffusion the same scaling properties as LLMs and is the architecture underneath Sora, Stable Diffusion 3, and Veo. [source]
- Long-form video generation — Sora, Veo, Runway Gen-4, and Kling moved video diffusion from 4-second clips to coherent minute-long shots with stable identities, camera control, and physics, opening up real production use. [source]
- World models — Video diffusion is being repurposed as a learned simulator — Genie, GWM-1, and Cosmos generate playable, physics-aware environments for robotics and embodied AI rather than passive video. [source]
- One-step / real-time generation — Distillation techniques (consistency models, flow matching, adversarial diffusion) collapsed image generation from 50+ steps to 1–4, enabling real-time interactive generation in design and gaming tools. [source]
- World-grounded omni video generation — Single model jointly trained for reasoning and multimodal generation, producing video that is grounded in real-world knowledge from text/image/audio/video inputs. [source]
Current frontier
- Runway Gen-4.5 (December 2025) demonstrates advanced physics understanding including momentum, weight, fluid dynamics, and object interactions, with native audio generation integrated. [source]
- Runway's GWM-1 world model enables frame-by-frame physics simulation with user-defined world parameters, representing shift from video generation to physics-aware simulation. [source]
- Sora 2 (September 2025) released with improved consistency and control, but OpenAI announced April 2026 shutdown and September 2026 API discontinuation, shifting to GPT-native video generation. [source]
- Google Imagen 3 (August 2024) surpasses DALL-E 3 and Stable Diffusion in photorealism and text rendering, achieving better prompt-image alignment across automated and human evaluations. [source]
- Midjourney V7 (April 2025, default June 2025) features 12B-parameter multimodal transformer with 20-30% faster rendering and improved hand/object coherence over V6. [source]
- NVIDIA Cosmos 3 (May 31, 2026): open Mixture-of-Transformers world-model family unifying physical reasoning (autoregressive VLM tower) and diffusion-based world/action generation; Nano (16B) and Super (64B) checkpoints, open-source SOTA on R-Bench, leading Artificial Analysis text-to-image and image-to-video. [source]
- Fused INT8 Triton kernel realizes native integer compute for diffusion transformers, enabling 1024px single-RTX-3090 generation (~9.5% end-to-end vs FP8; consumer-Ampere only). arXiv 2606.14598. [source]
Key people
- David Holz Founder and CEO · Midjourney [source]
- Emad Mostaque Founder (formerly CEO) · Stability AI [source]
- Oriol Vinyals VP Research, Gemini Co-lead · Google DeepMind [source]
- Yann LeCun Founder, Advanced Machine Intelligence Labs · AMI Labs; NYU Courant Institute [source]
- Tim Brooks Researcher (Sora team) · OpenAI [source]
- Runway Research Team Lead researchers on Gen-4.5 and GWM-1 · Runway [source]
Startups & labs to watch
- Stability AI Cascade Model Stability AI · LAB · Venture-backed; facing financial difficulties since 2024 — Stable Cascade achieves 2x faster inference than SDXL with 42x compression factor, enabling efficient fine-tuning (16x cost reduction) and edge deployment. [source]
- Open-Sora 2.0 HPAI Tech · LAB · Academic/open-source; community-driven — Commercial-level video generation trained for only $200k; reduced performance gap with OpenAI Sora from 4.52% to 0.69%, democratizing video generation. [source]
- Runway Media Creation Platform Runway · STARTUP · Series C, $500M+ valuation estimates — Leading commercial video generation with Gen-4.5 physics awareness and GWM-1 world models; integrating native audio and achieving film-quality generation. [source]
- Midjourney Next-Gen Development Midjourney · STARTUP · Profitable; Series D, $1B+ valuation estimates — V7 complete rebuild with improved coherence and speed; pioneering personalization efficiency (5 minutes vs. 200 images); video generation features under development. [source]