PodcastsEconomía y empresaLatent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

swyx + Alessio
Latent Space: The AI Engineer Podcast
Último episodio

185 episodios

  • Latent Space: The AI Engineer Podcast

    Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

    23/1/2026 | 1 h 32 min
    From shipping Gemini Deep Think and IMO Gold to launching the Reasoning and AGI team in Singapore, Yi Tay has spent the last 18 months living through the full arc of Google DeepMind's pivot from architecture research to RL-driven reasoning—watching his team go from a dozen researchers to 300+, training models that solve International Math Olympiad problems in a live competition, and building the infrastructure to scale deep thinking across every domain, and driving Gemini to the top of the leaderboards across every category. Yi Returns to dig into the inside story of the IMO effort and more!

    We discuss:

    Yi's path: Brain → Reka → Google DeepMind → Reasoning and AGI team Singapore, leading model training for Gemini Deep Think and IMO Gold

    The IMO Gold story: four co-captains (Yi in Singapore, Jonathan in London, Jordan in Mountain View, and Tong leading the overall effort), training the checkpoint in ~1 week, live competition in Australia with professors punching in problems as they came out, and the tension of not knowing if they'd hit Gold until the human scores came in (because the Gold threshold is a percentile, not a fixed number)

    Why they threw away AlphaProof: "If one model can't do it, can we get to AGI?" The decision to abandon symbolic systems and bet on end-to-end Gemini with RL was bold and non-consensus

    On-policy vs. off-policy RL: off-policy is imitation learning (copying someone else's trajectory), on-policy is the model generating its own outputs, getting rewarded, and training on its own experience—"humans learn by making mistakes, not by copying"

    Why self-consistency and parallel thinking are fundamental: sampling multiple times, majority voting, LM judges, and internal verification are all forms of self-consistency that unlock reasoning beyond single-shot inference

    The data efficiency frontier: humans learn from 8 orders of magnitude less data than models, so where's the bug? Is it the architecture, the learning algorithm, backprop, off-policyness, or something else?

    Three schools of thought on world models: (1) Genie/spatial intelligence (video-based world models), (2) Yann LeCun's JEPA + FAIR's code world models (modeling internal execution state), (3) the amorphous "resolution of possible worlds" paradigm (curve-fitting to find the world model that best explains the data)

    Why AI coding crossed the threshold: Yi now runs a job, gets a bug, pastes it into Gemini, and relaunches without even reading the fix—"the model is better than me at this"

    The Pokémon benchmark: can models complete Pokédex by searching the web, synthesizing guides, and applying knowledge in a visual game state? "Efficient search of novel idea space is interesting, but we're not even at the point where models can consistently apply knowledge they look up"

    DSI and generative retrieval: re-imagining search as predicting document identifiers with semantic tokens, now deployed at YouTube (symmetric IDs for RecSys) and Spotify

    Why RecSys and IR feel like a different universe: "modeling dynamics are strange, like gravity is different—you hit the shuttlecock and hear glass shatter, cause and effect are too far apart"

    The closed lab advantage is increasing: the gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that play well with everything built before

    Why ideas still matter: "the last five years weren't just blind scaling—transformers, pre-training, RL, self-consistency, all had to play well together to get us here"

    Gemini Singapore: hiring for RL and reasoning researchers, looking for track record in RL or exceptional achievement in coding competitions, and building a small, talent-dense team close to the frontier



    Yi Tay

    Google DeepMind: https://deepmind.google

    X: https://x.com/YiTayML

    Chapters

    00:00:00 Introduction: Returning to Google DeepMind and the Singapore AGI Team
    00:04:52 The Philosophy of On-Policy RL: Learning from Your Own Mistakes
    00:12:00 IMO Gold Medal: The Journey from AlphaProof to End-to-End Gemini
    00:21:33 Training IMO Cat: Four Captains Across Three Time Zones
    00:26:19 Pokemon and Long-Horizon Reasoning: Beyond Academic Benchmarks
    00:36:29 AI Coding Assistants: From Lazy to Actually Useful
    00:32:59 Reasoning, Chain of Thought, and Latent Thinking
    00:44:46 Is Attention All You Need? Architecture, Learning, and the Local Minima
    00:55:04 Data Efficiency and World Models: The Next Frontier
    01:08:12 DSI and Generative Retrieval: Reimagining Search with Semantic IDs
    01:17:59 Building GDM Singapore: Geography, Talent, and the Symposium
    01:24:18 Hiring Philosophy: High Stats, Research Taste, and Student Budgets
    01:28:49 Health, HRV, and Research Performance: The 23kg Journey
  • Latent Space: The AI Engineer Podcast

    Brex’s AI Hail Mary — With CTO James Reggio

    17/1/2026 | 1 h 13 min
    From building internal AI labs to becoming CTO of Brex, James Reggio has helped lead one of the most disciplined AI transformations inside a real financial institution where compliance, auditability, and customer trust actually matter.

    We sat down with Reggio to unpack Brex’s three-pillar AI strategy (corporate, operational, and product AI) [https://www.brex.com/journal/brex-ai-native-operations], how SOP-driven agents beat overengineered RL in ops, why Brex lets employees “build their own AI stack” instead of picking winners [https://www.conductorone.com/customers/brex/], and how a small, founder-heavy AI team is shipping production agents to 40,000+ companies. Reggio also goes deep on Brex’s multi-agent “network” architecture, evals for multi-turn systems, agentic coding’s second-order effects on codebase understanding, and why the future of finance software looks less like dashboards and more like executive assistants coordinating specialist agents behind the scenes.

    We discuss:

    Brex’s three-pillar AI strategy: corporate AI for 10x employee workflows, operational AI for cost and compliance leverage, and product AI that lets customers justify Brex as part of their AI strategy to the board

    Why SOP-driven agents beat overengineered RL in finance ops, and how breaking work into auditable, repeatable steps unlocked faster automation in KYC, underwriting, fraud, and disputes

    Building an internal AI platform early: LLM gateways, prompt/version management, evals, cost observability, and why platform work quietly became the force multiplier behind everything else

    Multi-agent “networks” vs single-agent tools: why Brex’s EA-style assistant coordinates specialist agents (policy, travel, reimbursements) through multi-turn conversations instead of one-shot tool calls

    The audit agent pattern: separating detection, judgment, and follow-up into different agents to reduce false negatives without overwhelming finance teams

    Centralized AI teams without resentment: how Brex avoided “AI envy” by tying work to business impact and letting anyone transfer in if they cared deeply enough

    Letting employees build their own AI stack: ChatGPT vs Claude vs Gemini, Cursor vs Windsurf, and why Brex refuses to pick winners in fast-moving tool races

    Measuring adoption without vanity metrics: why “% of code written by AI” is the wrong KPI and what second-order effects (slop, drift, code ownership) actually matter

    Evals in the real world: regression tests from ops QA, LLM-as-judge for multi-turn agents, and why integration-style evals break faster than you expect

    Teaching AI fluency at scale: the user → advocate → builder → native framework, ops-led training, spot bonuses, and avoiding fear-based adoption

    Re-interviewing the entire engineering org: using agentic coding interviews internally to force hands-on skill upgrades without formal performance scoring

    Headcount in the age of agents: why Brex grew the business without growing engineering, and why AI amplifies bad architecture as fast as good decisions

    The future of finance software: why dashboards fade, assistants take over, and agent-to-agent collaboration becomes the real UI



    James Reggio

    X: https://x.com/jamesreggio

    LinkedIn: https://www.linkedin.com/in/jamesreggio/

    Where to find Latent Space

    X: https://x.com/latentspacepod

    Substack: https://www.latent.space/

    Chapters

    00:00:00 Introduction
    00:01:24 From Mobile Engineer to CTO: The Founder's Path
    00:03:00 Quitters Welcome: Building a Founder-Friendly Culture
    00:05:13 The AI Team Structure: 10-Person Startup Within Brex
    00:11:55 Building the Brex Agent Platform: Multi-Agent Networks
    00:13:45 Tech Stack Decisions: TypeScript, Mastra, and MCP
    00:24:32 Operational AI: Automating Underwriting, KYC, and Fraud
    00:16:40 The Brex Assistant: Executive Assistant for Every Employee
    00:40:26 Evaluation Strategy: From Simple SOPs to Multi-Turn Evals
    00:37:11 Agentic Coding Adoption: Cursor, Windsurf, and the Engineering Interview
    00:58:51 AI Fluency Levels: From User to Native
    01:09:14 The Audit Agent Network: Finance Team Agents in Action
    01:03:33 The Future of Engineering Headcount and AI Leverage
  • Latent Space: The AI Engineer Podcast

    Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

    09/1/2026 | 1 h 18 min
    don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk

    —-

    From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah Hill-Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?

    We discuss:

    The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet

    Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers

    The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints

    How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)

    The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs

    Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest

    GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)

    The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)

    The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)

    Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future

    Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)

    V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)



    Artificial Analysis

    Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))

    George Cameron on X: https://x.com/grmcameron (https://x.com/grmcameron (\"https://x.com/grmcameron\"))

    Micah Hill-Smith on X: https://x.com/_micah_h (https://x.com/_micah_h (\"https://x.com/_micah_h\"))

    Chapters

    00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
    00:01:08 Business Model: Independence and Revenue Streams
    00:04:00 The Origin Story: From Legal AI to Benchmarking
    00:07:00 Early Challenges: Cost, Methodology, and Independence
    00:16:13 AI Grant and Moving to San Francisco
    00:18:58 Evolution of the Intelligence Index: V1 to V3
    00:27:55 New Benchmarks: Hallucination Rate and Omissions Index
    00:33:19 Critical Point and Frontier Physics Problems
    00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness
    00:51:47 The Openness Index: Measuring Model Transparency
    00:57:57 The Smiling Curve: Cost of Intelligence Paradox
    01:04:00 Hardware Efficiency and Sparsity Trends
    01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters
    01:10:47 Multimodal Benchmarking and Community Requests
    01:14:50 Looking Ahead: V4 Intelligence Index and Beyond
  • Latent Space: The AI Engineer Podcast

    Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

    08/1/2026 | 1 h 18 min
    don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk

    —-

    From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?

    We discuss:

    The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet

    Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers

    The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints

    How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)

    The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs

    Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest

    GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)

    The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)

    The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)

    Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future

    Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)

    V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)



    Artificial Analysis

    Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))

    George Cameron on X: https://x.com/georgecameron (https://x.com/georgecameron (\"https://x.com/georgecameron\"))

    Micah-Hill Smith on X: https://x.com/micahhsmith (https://x.com/micahhsmith (\"https://x.com/micahhsmith\"))

    Chapters

    00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
    00:01:08 Business Model: Independence and Revenue Streams
    00:04:00 The Origin Story: From Legal AI to Benchmarking
    00:07:00 Early Challenges: Cost, Methodology, and Independence
    00:16:13 AI Grant and Moving to San Francisco
    00:18:58 Evolution of the Intelligence Index: V1 to V3
    00:27:55 New Benchmarks: Hallucination Rate and Omissions Index
    00:33:19 Critical Point and Frontier Physics Problems
    00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness
    00:51:47 The Openness Index: Measuring Model Transparency
    00:57:57 The Smiling Curve: Cost of Intelligence Paradox
    01:04:00 Hardware Efficiency and Sparsity Trends
    01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters
    01:10:47 Multimodal Benchmarking and Community Requests
    01:14:50 Looking Ahead: V4 Intelligence Index and Beyond
  • Latent Space: The AI Engineer Podcast

    Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

    08/1/2026 | 1 h 18 min
    don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk

    —-

    From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?

    We discuss:

    The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet

    Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers

    The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints

    How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)

    The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs

    Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest

    GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)

    The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)

    The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)

    Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future

    Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)

    V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)



    Artificial Analysis

    Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))

    George Cameron on X: https://x.com/georgecameron (https://x.com/georgecameron (\"https://x.com/georgecameron\"))

    Micah-Hill Smith on X: https://x.com/micahhsmith (https://x.com/micahhsmith (\"https://x.com/micahhsmith\"))

    Chapters

    00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
    00:01:19 Business Model: Independence and Revenue Streams
    00:04:33 Origin Story: From Legal AI to Benchmarking Need
    00:16:22 AI Grant and Moving to San Francisco
    00:19:21 Intelligence Index Evolution: From V1 to V3
    00:11:47 Benchmarking Challenges: Variance, Contamination, and Methodology
    00:13:52 Mystery Shopper Policy and Maintaining Independence
    00:28:01 New Benchmarks: Omissions Index for Hallucination Detection
    00:33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning
    00:23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks
    00:50:19 Stirrup Agent Harness: Open Source Agentic Framework
    00:52:43 Openness Index: Measuring Model Transparency Beyond Licenses
    00:58:25 The Smiling Curve: Cost Falling While Spend Rising
    01:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits
    01:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges
    01:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas
    01:15:05 Looking Ahead: Intelligence Index V4 and Future Directions
    01:16:50 Closing: The Insatiable Demand for Intelligence

Más podcasts de Economía y empresa

Acerca de Latent Space: The AI Engineer Podcast

The podcast by and for AI Engineers! In 2024, over 2 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space
Sitio web del podcast

Escucha Latent Space: The AI Engineer Podcast, El Director - Invertir en Bolsa y finanzas y muchos más podcasts de todo el mundo con la aplicación de radio.es

Descarga la app gratuita: radio.es

  • Añadir radios y podcasts a favoritos
  • Transmisión por Wi-Fi y Bluetooth
  • Carplay & Android Auto compatible
  • Muchas otras funciones de la app
Aplicaciones
Redes sociales
v8.3.0 | © 2007-2026 radio.de GmbH
Generated: 1/23/2026 - 8:26:57 PM