Feed

Join the Hivemind

[
]
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
See catalogue
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
by Lex
07.08.2025

Genie 3 has taken the AI world by storm in recent days, stunning everyone with its uncanny consistency and real-time interactivity in generating virtual worlds. The Machine Learning Street Talk podcast featured an exclusive interview with DeepMind’s Shlomi Fuchter and Jack Parker Holder, offering a rare behind-the-scenes look at Genie 3.

Right from the start, the host declared it “the most mind-blowing AI technology” he’d ever seen. Genie 3 can generate explorable and interactive virtual worlds from a text prompt in just three seconds, maintaining persistent objects and events, and even triggering world events on demand (like having a deer suddenly run into the scene). While the team remains tight-lipped about the underlying architecture, they confirmed that Genie 3 relies on frame-by-frame autoregressive generation—each frame depends on the historical state, ensuring world coherence and interactivity.

The evolution of the Genie series is clear:

  • Genie 1: Trained on 2D platformer game recordings, the model learned “parallax effects” and could unsupervisedly discover high-level action semantics like “jump” and “move left.”
  • Genie 2: Upgraded to 3D, with resolution bumped to 360p, it could simulate lighting, smoke, water flow, and other physical phenomena, with enhanced memory and object permanence.
  • Genie 3: Leaps to 720p, supports several minutes of real-time interaction, and switches input from images to text prompts, vastly increasing flexibility.

The team particularly emphasized that Genie 3’s “emergent consistency” is not due to explicit 3D modeling (like NeRF), but rather a capability that the neural network acquires spontaneously through large-scale training. With just a single prompt, the model can generate a complete, explorable, interactive environment.

In terms of application prospects, DeepMind positions Genie 3 as the ideal “experiential training ground” for AI robotics. It provides high-fidelity, dynamically controllable simulated environments, dramatically reducing the cost and risk of large-scale training, and accelerating “sim-to-real” transfer. Users can inject rare or extreme events via prompts, making AI more robust and generalizable. As I noted in my earlier report, “Shift Toward the Experience Era,” one of the key weaknesses of large language models is their limited spatiotemporal perception. A world model like Genie 3 both compensates for that gap and offers LLMs an experimental sandbox. The alternating and mutually reinforcing progress of these two technologies could ultimately pave the way for well-rounded general intelligence.

Of course, the team is candid about Genie 3’s current limitations: it does not yet support multi-agent collaboration, its physics and perception still fall short of fully replicating human embodied experience, it demands enormous compute and system stability, and its creativity is still highly dependent on prompts. In the future, Genie 3 is likely to co-evolve with LLMs and reasoning models, moving ever closer to “physics-grade” agent simulation and behavior generation.

People might wonder what technology Genie shares in common with video generation models like Sora or Veo, and what sets them apart. Here’s a concise technical summary based on current public information:

Shared technical foundations

  • Both compress raw videos into discrete spatiotemporal tokens and condition generation on text prompts.
  • Both are trained with objectives that include reconstruction loss and temporal consistency, ensuring coherent appearance and motion across frames.
  • Both are trained on hundreds of thousands of hours of internet-scale video data, covering diverse scenes and viewpoints.

Key technical differences

  • Token design
    • Genie 3: Uses spatiotemporal VQ-VAE for low-bitrate tokens, and also learns 8-dimensional latent-action tokens that explicitly encode control commands like “move forward” or “turn.”
    • Veo: Employs latent diffusion models (VDM/Stable-Video style) with continuous high-dimensional latent patches, and has no explicit action channel.
  • Architecture & training objective
    • Genie 3: Decoder-only spatiotemporal Transformer trained with teacher-forcing to predict the next frame’s tokens (NLL + KL). Every step, it autoregressively reads all past tokens and actions.
    • Veo: UNet + cross-attention diffusion network, aiming to denoise and reconstruct the entire latent sequence in steps. Training is unconditional on actions; it simply reorders noise sequences.
  • Generation loop
    • Genie 3: Frame-by-frame stepping at 24 fps in real time; each frame can incorporate user actions or text events, recalculating attention on the fly.
    • Veo: Clip-level diffusion (typically 8–16 steps) generates an entire video in one batch; no real-time control can be injected during generation.
  • Physical/interactive effects
    • Genie 3: Thanks to action tokens and recursive history, can maintain several minutes of interactive consistency and implicitly learn dynamics like friction and gravity.
    • Veo: Prioritizes 4K-level visual fidelity and long-range cinematic consistency, but lacks controllable physical causality; its output is more like a linear movie segment.

In short, Genie 3 is an interactive world model built on autoregressive Transformers and action tokens, while Veo and similar models represent offline video generation using diffusion UNets and purely visual tokens. They differ fundamentally in token structure, training targets, inference flow, and the degree of interactivity.

These are some good reading materials in case you are interested in more technical details:

This is some text inside of a div block.
by Lex
05.08.2025

We’ve long assumed that LLMs are “frozen” after training—if you want them to learn something new, you need to fine-tune. But Dario Amodei recently hinted that Anthropic is experimenting with models that can “learn” inside 100M-token contexts, without ever updating weights. Sounded mysterious—until Google’s new paper “Learning without training” seems pull back the curtain.

Here’s the core trick:

1. Self-attention extracts a “delta.”
Feed a query x alone and you get activation A(x). Feed x with context C and you get A(C,x). The difference ΔA = A(C,x) – A(x) is a compact summary of what the context “teaches” the model.

2. MLP turns this delta into a rank-1 sticky note.
When ΔA passes through the MLP, the math is exactly equivalent to adding a rank-1 update ΔW = (W·ΔA)·A(x)ᵀ / ||A(x)||² to the weight matrix W. For a 4,096×4,096 MLP (≈17M numbers), this update only needs two 4,096-dim vectors (≈8k numbers)—a 2,000× compression.

3. Temporary adaptation, no trace left.
During inference, using {W + ΔW, x} gives the same result as {W, [C,x]}. The “sticky note” ΔW is discarded after the forward pass. The base model never changes, but for one moment, it acts as if fine-tuned for your context.

This looks quite similar with LoRA with some key differences: LoRA injects low-rank adapters—learned and stored offline—while Google’s “sticky note” is computed at runtime, costs zero extra storage, and vanishes instantly. LoRA is like a notebook you carry; the sticky note is a Post-it you jot down and toss.

Why does this matter anyway? If context can imprint such efficient updates, scaling context windows may outpace continual fine-tuning. Combine this with retrieval, external memory, or LoRA-style caches, and you get a spectrum of “learning without training.” The challenges remain that how to chain these ephemeral updates for true long-term adaptation. As Dario and Altman push toward $100k/month models with 100M-token windows, the future may belong to those who can stack and orchestrate these tricks—unless open research keeps the playing field level.

Paper definitely worth reading: arxiv.org/abs/2507.16003

This is some text inside of a div block.
by Pondering Durian
01.08.2025

I just came across "The Edge of Automation"; a Substack written by Joe Ryu. It's a publication dedicated to Robotics, and the level of detail is fantastic.

For example: The Robotics Threshold: China's Rise, America's Reckoning is an absolute bible on the topic.

Top 10 Takeaways courtesy of o3 below, but the entire piece is worth reading for anyone keen on Robotics or US vs China dynamics:

Bullet Summary

  1. Automation as a civilizational threshold
    True robotics (general-purpose mobile manipulation + physical AI) ends the scarcity of human physical labor, resetting productivity, supply chains, and power structures. Nations that cross this threshold gain autonomy; laggards face structural dependency.
  2. Binary future: a few winners, many dependents
    Because the productivity gap from full automation is vast, countries won’t cluster in the middle. Control over automated production becomes the new determinant of sovereignty and geopolitical weight.
  3. China’s head start is systemic, not incidental
    China couples massive deployment with state-guided standardization, deep component supply chains, and a huge bench of robotics firms. Real-world feedback loops and cost-down scale give it accelerating returns across hardware, software, and operations.
  4. State–civil harmony as a force multiplier
    Technocrats, long-horizon policy, resource diplomacy, and coordinated execution (from MIIT guidance to national testbeds) knit hardware, AI, logistics, and maintenance into a cohesive, rapidly compounding ecosystem.
  5. Ecosystem depth beats point excellence
    China’s edge spans the full stack—sensors, actuators, batteries, AI models, standards, factories, service networks, and fleet management—so breakthroughs diffuse quickly and costs fall. The U.S. has world-class labs but a thin, fragmented industrial base.
  6. America’s core weaknesses are structural
    Decades of deindustrialization, brittle logistics, skills shortages, missing post-production support (spares, service, fleet ops), and premature consolidation push U.S. firms toward expensive vertical integration still dependent on foreign components.
  7. Physical AI is the bridge to AGI—and data gravity favors deployers
    Frontier capability will come from learning in the real world. Dense robotic deployment generates the multimodal interaction data needed for generalization; whoever operates the largest fleets compounds AI advantage fastest.
  8. The window is short and the change is non-linear
    A first wave of limited robots will be followed quickly by a second wave with major cost and capability breakouts. Decisions in the next 8–10 years will lock in leadership tiers for decades.
  9. A pragmatic U.S. path: three stages
    (1) Build leverage at home—focus on durable hardware, unified software/agentic ops, standards, shared assets, and specialized services.
    (2) Use that leverage to automate allied supply chains and form a U.S.-anchored regional bloc (Europe + East Asia + parts of Indochina), while keeping core robotics IP and assembly domestic.
    (3) Reshore and integrate at scale—renew infrastructure, achieve Level-5 autonomous systems, and reach sustainable supply-chain sovereignty.
  10. Strategic north star: sovereign automation to preserve Western prosperity
    The goal isn’t incremental competitiveness; it’s an autonomous, self-evolving industrial base that restores productivity leadership and prevents “high-tech vassalage.” This demands technocratic coordination, hard tradeoffs, and urgency equal to the stakes.

This is some text inside of a div block.
Link Copied