Feed

Join the Hivemind

[
]
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
See catalogue
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
by Lex
08.08.2025

In his article "How AI Conquered the US Economy—and What Happens Next" Derek Thompson uses a historical yardstick to place the AI investment boom within the panorama of U.S. economic development, giving us a more comprehensive understanding of these figures. In Q2 and Q3 of 2025, tech giants' capital expenditure on AI data centers has reached $400 billion, nearly $100 billion per quarter, accounting for approximately 1.36% of U.S. GDP during the same period. This figure not only exceeds the U.S. government's annual budget for education, employment, and social services, but can also be compared to the entire European defense spending.

Thompson emphasizes that AI-related capital expenditure approaches 2% of U.S. GDP. While the investment intensity has not yet reached the peak of the 19th-century railroad boom (in terms of GDP proportion, it's currently about 20% of the railroad investment peak), it has far surpassed the historical levels of the internet bubble and telecommunications investment.

More remarkably, in the first half of 2025, AI capital expenditure's marginal contribution to GDP growth rate surpassed consumer spending, the traditional engine, for the first time, becoming the biggest driver of economic expansion this year.

These funds primarily come from tech giants' free cash flow, capital market financing, and high investor expectations for the AI sector. Microsoft, Amazon, Google, Meta, and other companies have bet almost all their profits and financing on data center and AI computing power expansion, creating a concentrated flow of capital to very few sectors. Meanwhile, traditional manufacturing and consumer-oriented startups face cold financing conditions, with the economy's "diversified engines" being replaced by a "computing power monopoly."

Can these investments justify their returns? Not yet. AI hardware suppliers like NVIDIA have achieved doubled revenue and profit growth in 2025, driving related companies' market values to soar. The rises in the S&P 500 and NASDAQ are almost entirely driven by the "AI Seven Giants," while traditional consumer and manufacturing sectors show weakness. However, this income structure is highly dependent on the continued expansion of AI computing power demand. Once application-layer innovation or demand expectations cool down, related sectors may face dramatic adjustments in revenue and market value.

The employment landscape similarly shows structural bias. In May and June 2025, U.S. non-farm employment increases were only 19K and 14K respectively, both experiencing significant downward revisions, while July's new employment was only 73K, of which healthcare contributed 55K, accounting for 75%. Except for healthcare and social services, other industries showed almost zero or negative growth, with employment growth increasingly dependent on a few sectors. Meanwhile, the talent war in the AI field has reached an "NBA level." Meta has offered compensation packages worth hundreds of millions or even billions to top researchers, far exceeding the compensation levels of historical national technology projects such as the Manhattan Project and Apollo moon landing. These sky-high salaries not only make AI researchers worth more than most NBA stars but also highlight Silicon Valley's crazy bet on "superintelligence."

Finally, a bonus point: since the rise of LLMs, academic research has undergone tremendous changes. In 2024, the occurrence of the word "delves" exceeded the historical average by 2700%... It's estimated that 1/7 of abstracts have been processed by AI (including this article, of course)...

Regardless of the outcome of this investment boom, this is definitely no longer just a game that only affects the capital markets...

== Sources ==

This is some text inside of a div block.
by Lex
07.08.2025

Genie 3 has taken the AI world by storm in recent days, stunning everyone with its uncanny consistency and real-time interactivity in generating virtual worlds. The Machine Learning Street Talk podcast featured an exclusive interview with DeepMind’s Shlomi Fuchter and Jack Parker Holder, offering a rare behind-the-scenes look at Genie 3.

Right from the start, the host declared it “the most mind-blowing AI technology” he’d ever seen. Genie 3 can generate explorable and interactive virtual worlds from a text prompt in just three seconds, maintaining persistent objects and events, and even triggering world events on demand (like having a deer suddenly run into the scene). While the team remains tight-lipped about the underlying architecture, they confirmed that Genie 3 relies on frame-by-frame autoregressive generation—each frame depends on the historical state, ensuring world coherence and interactivity.

The evolution of the Genie series is clear:

  • Genie 1: Trained on 2D platformer game recordings, the model learned “parallax effects” and could unsupervisedly discover high-level action semantics like “jump” and “move left.”
  • Genie 2: Upgraded to 3D, with resolution bumped to 360p, it could simulate lighting, smoke, water flow, and other physical phenomena, with enhanced memory and object permanence.
  • Genie 3: Leaps to 720p, supports several minutes of real-time interaction, and switches input from images to text prompts, vastly increasing flexibility.

The team particularly emphasized that Genie 3’s “emergent consistency” is not due to explicit 3D modeling (like NeRF), but rather a capability that the neural network acquires spontaneously through large-scale training. With just a single prompt, the model can generate a complete, explorable, interactive environment.

In terms of application prospects, DeepMind positions Genie 3 as the ideal “experiential training ground” for AI robotics. It provides high-fidelity, dynamically controllable simulated environments, dramatically reducing the cost and risk of large-scale training, and accelerating “sim-to-real” transfer. Users can inject rare or extreme events via prompts, making AI more robust and generalizable. As I noted in my earlier report, “Shift Toward the Experience Era,” one of the key weaknesses of large language models is their limited spatiotemporal perception. A world model like Genie 3 both compensates for that gap and offers LLMs an experimental sandbox. The alternating and mutually reinforcing progress of these two technologies could ultimately pave the way for well-rounded general intelligence.

Of course, the team is candid about Genie 3’s current limitations: it does not yet support multi-agent collaboration, its physics and perception still fall short of fully replicating human embodied experience, it demands enormous compute and system stability, and its creativity is still highly dependent on prompts. In the future, Genie 3 is likely to co-evolve with LLMs and reasoning models, moving ever closer to “physics-grade” agent simulation and behavior generation.

People might wonder what technology Genie shares in common with video generation models like Sora or Veo, and what sets them apart. Here’s a concise technical summary based on current public information:

Shared technical foundations

  • Both compress raw videos into discrete spatiotemporal tokens and condition generation on text prompts.
  • Both are trained with objectives that include reconstruction loss and temporal consistency, ensuring coherent appearance and motion across frames.
  • Both are trained on hundreds of thousands of hours of internet-scale video data, covering diverse scenes and viewpoints.

Key technical differences

  • Token design
    • Genie 3: Uses spatiotemporal VQ-VAE for low-bitrate tokens, and also learns 8-dimensional latent-action tokens that explicitly encode control commands like “move forward” or “turn.”
    • Veo: Employs latent diffusion models (VDM/Stable-Video style) with continuous high-dimensional latent patches, and has no explicit action channel.
  • Architecture & training objective
    • Genie 3: Decoder-only spatiotemporal Transformer trained with teacher-forcing to predict the next frame’s tokens (NLL + KL). Every step, it autoregressively reads all past tokens and actions.
    • Veo: UNet + cross-attention diffusion network, aiming to denoise and reconstruct the entire latent sequence in steps. Training is unconditional on actions; it simply reorders noise sequences.
  • Generation loop
    • Genie 3: Frame-by-frame stepping at 24 fps in real time; each frame can incorporate user actions or text events, recalculating attention on the fly.
    • Veo: Clip-level diffusion (typically 8–16 steps) generates an entire video in one batch; no real-time control can be injected during generation.
  • Physical/interactive effects
    • Genie 3: Thanks to action tokens and recursive history, can maintain several minutes of interactive consistency and implicitly learn dynamics like friction and gravity.
    • Veo: Prioritizes 4K-level visual fidelity and long-range cinematic consistency, but lacks controllable physical causality; its output is more like a linear movie segment.

In short, Genie 3 is an interactive world model built on autoregressive Transformers and action tokens, while Veo and similar models represent offline video generation using diffusion UNets and purely visual tokens. They differ fundamentally in token structure, training targets, inference flow, and the degree of interactivity.

These are some good reading materials in case you are interested in more technical details:

This is some text inside of a div block.
by Lex
05.08.2025

We’ve long assumed that LLMs are “frozen” after training—if you want them to learn something new, you need to fine-tune. But Dario Amodei recently hinted that Anthropic is experimenting with models that can “learn” inside 100M-token contexts, without ever updating weights. Sounded mysterious—until Google’s new paper “Learning without training” seems pull back the curtain.

Here’s the core trick:

1. Self-attention extracts a “delta.”
Feed a query x alone and you get activation A(x). Feed x with context C and you get A(C,x). The difference ΔA = A(C,x) – A(x) is a compact summary of what the context “teaches” the model.

2. MLP turns this delta into a rank-1 sticky note.
When ΔA passes through the MLP, the math is exactly equivalent to adding a rank-1 update ΔW = (W·ΔA)·A(x)ᵀ / ||A(x)||² to the weight matrix W. For a 4,096×4,096 MLP (≈17M numbers), this update only needs two 4,096-dim vectors (≈8k numbers)—a 2,000× compression.

3. Temporary adaptation, no trace left.
During inference, using {W + ΔW, x} gives the same result as {W, [C,x]}. The “sticky note” ΔW is discarded after the forward pass. The base model never changes, but for one moment, it acts as if fine-tuned for your context.

This looks quite similar with LoRA with some key differences: LoRA injects low-rank adapters—learned and stored offline—while Google’s “sticky note” is computed at runtime, costs zero extra storage, and vanishes instantly. LoRA is like a notebook you carry; the sticky note is a Post-it you jot down and toss.

Why does this matter anyway? If context can imprint such efficient updates, scaling context windows may outpace continual fine-tuning. Combine this with retrieval, external memory, or LoRA-style caches, and you get a spectrum of “learning without training.” The challenges remain that how to chain these ephemeral updates for true long-term adaptation. As Dario and Altman push toward $100k/month models with 100M-token windows, the future may belong to those who can stack and orchestrate these tricks—unless open research keeps the playing field level.

Paper definitely worth reading: arxiv.org/abs/2507.16003

This is some text inside of a div block.
Link Copied