Podcasts NoticiasThursdAI - The top AI news from the past week

Escucha gratis este podcast en la aplicación:

radio.es

Sleep timer

Despertador

Favoritos

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week

Noticias Noticias tecnológicas Tecnología

ThursdAI - The top AI news from the past week

Último episodio

Episodios disponibles

5 de 103

ThursdAI - May 8th - new Gemini pro, Mistral Medium, OpenAI restructuring, HeyGen Realistic Avatars & more AI news
Hey folks, Alex here (yes, real me, not my AI avatar, yet)Compared to previous weeks, this week was pretty "chill" in the world of AI, though we did get a pretty significant Gemini 2.5 Pro update, it basically beat itself on the Arena. With Mistral releasing a new medium model (not OSS) and Nvidia finally dropping Nemotron Ultra (both ignoring Qwen 3 performance) there was also a few open source updates. To me the highlight of this week was a breakthrough in AI Avatars, with Heygen's new IV model, Beating ByteDance's OmniHuman (our coverage) and Hedra labs, they've set an absolute SOTA benchmark for 1 photo to animated realistic avatar. Hell, Iet me record all this real quick and show you how good it is! How good is that?? I'm still kind of blown away. I have managed to get a free month promo code for you guys, look for it in the TL;DR section at the end of the newsletter. Of course, if you’re rather watch than listen or read, here’s our live recording on YTOpenSource AINVIDIA's Nemotron Ultra V1: Refining the Best with a Reasoning Toggle 🧠NVIDIA also threw their hat further into the ring with the release of Nemotron Ultra V1, alongside updated Super and Nano versions. We've talked about Nemotron before – these are NVIDIA's pruned and distilled versions of Llama 3.1, and they've been impressive. The Ultra version is the flagship, a 253 billion parameter dense model (distilled and pruned from Llama 3.1 405B), and it's packed with interesting features.One of the coolest things is the dynamic reasoning toggle. You can literally tell the model "detailed thinking on" or "detailed thinking off" via a system prompt during inference. This is something Qwen also supports, and it looks like the industry is converging on this idea of letting users control the "depth" of thought, which is super neat.Nemotron Ultra boasts a 128K context window and, impressively, can fit on a single 8xH100 node thanks to Neural Architecture Search (NAS) and FFN-Fusion. And performance-wise, it actually outperforms the Llama 3 405B model it was distilled from, which is a big deal. NVIDIA shared a chart from Artificial Analysis (dated April 2025, notably before Qwen3's latest surge) showing Nemotron Ultra standing strong among models like Gemini 2.5 Flash and Opus 3 Mini.What's also great is NVIDIA's commitment to openness here: they've released the models under a commercially permissive NVIDIA Open Model License, the complete post-training dataset (Llama-Nemotron-Post-Training-Dataset), and their training codebases (NeMo, NeMo-Aligner, Megatron-LM). This allows for reproducibility and further community development. Yam Peleg pointed out the cool stuff they did with Neural Architecture Search to optimally reduce parameters without losing performance.Absolute Zero: AI Learning to Learn, Zero (curated) Data Required! (Arxiv)LDJ brought up a fascinating paper that ties into this theme of self-improvement and reinforcement learning: "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" from Andrew Zhao (Tsinghua University) and a few othersThe core idea here is a system that self-evolves its training curriculum and reasoning ability. Instead of needing a pre-curated dataset of problems, the model creates the problems itself (e.g., code reasoning tasks) and then uses something like a Code Executor to validate its proposed solutions, serving as a unified source of verifiable reward. It's open-ended yet grounded learning.By having a verifiable environment (code either works or it doesn't), the model can essentially teach itself to code without external human-curated data.The paper shows fine-tunes of Qwen models (like Qwen Coder) achieving state-of-the-art results on benchmarks like MBBP and AIME (Math Olympiad) with no pre-existing data for those problems. The model hallucinates questions, creates its own rewards, learns, and improves. This is a step beyond synthetic data, where humans are still largely in charge of generation. It's wild, and it points towards a future where AI systems could become increasingly autonomous in their learning.Big Companies & APIsGoogle dropped another update to their Gemini 2.5 Pro, this time the "IO edition" preview, specifically touting enhanced coding performance. This new version jumped to the #1 spot on WebDev Arena (a benchmark where human evaluators choose between two side-by-side code generations in VS Code), with a +147 Elo point gain, surpassing Claude 3.7 Sonnet. It also showed improvements on benchmarks like LiveCodeBench (up 7.39%) and Aider Polyglot (up ~3-6%). Google also highlighted its state-of-the-art video understanding (84.8% on VideoMME) with examples like generating code from a video of an app. Which essentially lets you record a drawing of how your app interaction will happen, and the model will use that video instructions! It's pretty cool. Though, not everyone was as impressed, folks noted that while gaining in a few evals, this model also regressed in several others including Vibe-Eval (Reka's multimodal benchmark), Humanity's Last Exam, AIME, MMMU, and even long context understanding (MRCR). It's a good reminder that model updates often involve trade-offs – you can't always win at everything.BREAKING: Gemini's Implicit Caching - A Game Changer for Costs! 💰Just as we were wrapping up this segment on the show, news broke that Google launched implicit caching in Gemini APIs! This is a huge deal for developers.Previously, Gemini offered explicit caching, where you had to manually tell the API what context to cache – a bit of a pain. Now, with implicit caching, the system automatically enables up to 75% cost savings when your request hits a cache. This is fantastic, especially for long-context applications, which is where Gemini's 1-2 million token context window really shines. If you're repeatedly sending large documents or codebases, this will significantly reduce your API bills. OpenAI has had automatic caching for a while, and it's great to see Google matching this for a much better developer experience and cost-effectiveness. It also saves Google a ton on inference, so it's a win-win!Mistral Medium 3: The Closed Turn 😥Mistral, once the darling of the open-source community for models like Mistral 7B and Mixtral, announced Mistral Medium 3. The catch? It's not open source.They're positioning it as a multimodal frontier model with 128K context, claiming it matches or surpasses GPT-4-class benchmarks while being cheaper (priced at $0.40/M input and $2/M output tokens). However they haven't added Gemini Flash 2.5 here, which is 70% cheaper while being faster as well, nor did they mention Qwen. Nisten voiced a sentiment many in the community share: he used to use LeChat frequently because he knew and understood the underlying open-source models. Now, with a closed model, it's a black box. It's a bit like pirating music users often being the biggest buyers – understanding the open model often leads to more commercial usage.Wolfram offered a European perspective, noting that Mistral, as a European company, might have a unique advantage with businesses concerned about GDPR and data sovereignty, who might be hesitant to use US or Chinese cloud APIs. For them, a strong European alternative, even if closed, could be appealing.OpenAI's New Chapter: Restructuring for the Future OpenAI announced an evolution in its corporate structure. The key points are:* The OpenAI non-profit will continue to control the entire organization.* The existing for-profit LLC will become a Public Benefit Corporation (PBC).* The non-profit will be a significant owner of the PBC and will control it.* Both the non-profit and PBC will continue to share the same mission: ensuring AGI benefits all of humanity.This move seems to address some of the governance concerns that have swirled around OpenAI, particularly in light of Elon Musk's lawsuit regarding its shift from a non-profit to a capped-profit entity. LDJ explained that the main worry for many was whether the non-profit would lose control or its stake in the main research/product arm. This restructuring appears to ensure the non-profit remains at the helm and that the PBC is legally bound to the non-profit's mission, not just investor interests. It's an important step for a company with such a profound potential impact on society.And in related OpenAI news, the acquisition of Windsurf (the VS Code fork) for a reported $3 billion went through, while Cursor (another VS Code fork) announced a $9 billion valuation. It's wild to see these developer tools, which are essentially forks with an AI layer, reaching such massive valuations. Microsoft's hand is in all of this too – investing in OpenAI, invested in Cursor, owning VS Code, and now OpenAI buying Windsurf. It's a tangled web!Finally, a quick mention that Sam Altman (OpenAI), Lisa Su (AMD), Mike Intrator (CoreWeave - my new CEO!), and folks from Microsoft were testifying before the U.S. Senate today about how to ensure America leads in AI and what innovation means. These conversations are crucial as AI continues to reshape our world.This Weeks Buzz - Come Vibe with Us at Fully Connected! (SF, June 18-19) 🎉Our two-day conference, Fully Connected, is happening in San Francisco on June 18th and 19th, and it's going to be awesome! We've got an incredible lineup of speakers, including Joe Spizak from the Llama team at Meta and Varun from Windsurf. It's two full days of programming, learning, and connecting with folks at the forefront of AI.And because you're part of the ThursdAI family, I've got a special promo code for you: use WBTHURSAI to get a free ticket on me! If you're in or around SF, I'd love to see you there. Come hang out, learn, and vibe with us! Register at fullyconnected.comHackathon Update: Moved to July! 🗓️The AGI Evals & Agentic Tooling (A2A) + MCP Hackathon that I was super excited to co-host has been postponed to July 12th-13th. Mark your calendars! I'll share more details and the invite soon.W&B Joins CoreWeave! A New Era Begins! 🚀And the big personal news for me and the entire Weights & Biases team: the acquisition of Weights & Biases by CoreWeave has been completed! CoreWeave is the ultra-fast-growing provider of GPUs that powers so much of the AI ecosystem.So, from now on, it's Alex Volkov, AI Evangelist at Weights & Biases, from CoreWeave! (And as always, the opinions I share here are my own and not necessarily those of CoreWeave, especially important now that they're a public company!). I'm incredibly excited about this new chapter. W&B isn't going anywhere as a product; if anything, this will empower us to build even better developer tooling and integrate more deeply to help you run your models wherever you choose. Expect more cool stuff to come, especially as I figure out where all those spare GPUs are lying around at CoreWeave! 😉Vision & VideoAI Avatars SOTA with HeyGen IVOk, as you saw above, the HeyGen IV avatars are absolutely bonkers. I did a comparison thread on X, and HeyGen's new thing absolutely takes SOTA between ByteDance OmniHuman and Hedra Labs! All you need to do is upload 1 image of yourself, can even be an AI generated image, can be a side profile, can be a dog, an Anime character and they will generate up to 30 seconds of incredible lifelike avatar with the audio you provide! I was so impressed with this, I reached out to HeyGen and scored a 1 month free code for you all, use THURSDAY4 and get a free month to try it out. Please tag me in whatever you create if you publish, I'd love to see where you take this! Quick Hits: Lightricks LTXV & HunyuanCustomBriefly, on the open-weights video front:* Lightricks LTXV 13B: The company from Jerusalem released an upgraded 13 billion parameter version of their LTX video model. It requires more VRAM but offers higher quality, keyframe and character movement support, multi-shot support, and multi-keyframe conditioning (a feature Sora famously has). It's fully open and supports LoRAs for custom styles.* HunyuanCustom: From Tencent, this model is about to be released (GitHub/Hugging Face links were briefly up then down). It promises multi-modal, subject-consistent video generation without LoRAs, based on a subject you provide (image, and eventually video/audio). It can take an image of a person or object and generate a video with that subject consistently. They also teased audio conditioning – making an avatar sing or speak based on input audio – and even style transfer where you can replace a character in a video with another reference image, all looking very promising for open source.The World of AI AudioJust a couple of quick mentions in the audio space:* ACE-Step 3.5B: From StepFun, this is a 3.5 billion parameter, fully open-source (Apache-2.0) foundation model for music generation. It uses a diffusion-based approach and can synthesize up to 4 minutes of music in just 20 seconds on an A100 GPU. It's not quite at Suno/Udio levels yet, but it's a strong open-source contender.* NVIDIA Parakeet TDT 0.6B V2: NVIDIA released this 600 million parameter transcription model that is blazing fast. It can transcribe 60 minutes of audio in just one second on production GPUs and works well locally too. It currently tops the OpenASR leaderboard on Hugging Face for English transcription and is a very strong Whisper competitor, especially for speed.Conclusion and TL;DR * Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Open Source LLMs * Wolfram's Qwen3 evals (X, Github) * NVIDIA - Nemotron Ultra V1 (+ updated Super & Nano) (HF)* Cognition Kevin-32B = K(ernel D)evin - RL for writing CUDA kernels (Blog, HF)* Absolute Zero: Reinforced Self-play Reasoning with Zero Data (ArXiv)* Big CO LLMs + APIs* Gemini Pro 2.5 IO tops ... Gemini 2.5 as the top LLM (Blog)* Mistral Medium 3 - (Blog | X )* Figma announces Figma Make - Bolt/Lovable competitors (Figma)* OpenAI Restructures: Nonprofit Keeps Control, LLC Becomes PB (Blog)* Cursor worth $9B while Windsurf sells to OpenAI at $3B* Sam Altman, Lisa Su, Mike Intrator testify in Senate (Youtube)* This weeks Buzz* Fully Connected: W&B's 2-day conference, June 18-19 in SF fullyconnected.com - Promo Code WBTHURSAI * Hackathon moved to July 12-13* Vision & Video* Lightricks a new "open weights" LTXV 13B ( LTX Studio, HF)* HeyGen Avatar IV - SOTA digital avatars - 1 month for free with THURSDAY4 (X, HeyGen)* HunyuanCustom - multi-modal subject-consistent video generation model (Examples, Github, HF)* Voice & Audio* ACE-Step 3.5B: open-source foundation model for AI music generation (project)* Nvidia - Parakeet TDT 0.6B V2 - transcribe 60 minutes of audio in just 1 second (HF, Demo)So, there you have it – a "chill" week that still managed to deliver some incredible advancements, particularly in AI avatars with HeyGen, continued strength in open-source models like Qwen3, and Google's relentless push with Gemini. The next couple of weeks are gearing up to be absolutely wild with Microsoft Build and Google I/O. I expect a deluge of announcements, and you can bet we'll be here on ThursdAI to break it all down for you.Thanks to Yam, Wolfram, LDJ, and Nisten for their insights on the show, and thanks to all of you for tuning in, reading, and being part of this amazing community. We stay up to date so you don't have to!Catch you next week!Cheers,Alex This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
--------
1:33:54
📆 ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate, RIP GPT4, LlamaCon, LMArena in hot water & more AI news
Hey everyone, Alex here 👋Welcome back to ThursdAI! And wow, what a week. Seriously, strap in, because the AI landscape just went through some seismic shifts. We're talking about a monumental open-source release from Alibaba with Qwen 3 that has everyone buzzing (including us!), Microsoft dropping Phi-4 with Reasoning, a rather poignant farewell to a legend (RIP GPT-4 – we'll get to the wake shortly), major drama around ChatGPT's "glazing" incident and the subsequent rollback, updates from LlamaCon, a critical look at Chatbot Arena, and a fantastic deep dive into the world of AI evaluations with two absolute experts, Hamel Husain and Shreya Shankar.This week felt like a whirlwind, with open source absolutely dominating the headlines. Qwen 3 didn't just release a model; they dropped an entire ecosystem, setting a potential new benchmark for open-weight releases. And while we pour one out for GPT-4, we also have to grapple with the real-world impact of models like ChatGPT, highlighted by the "glazing" fiasco. Plus, video consistency takes a leap forward with Runway, and we got breaking news live on the show from Claude!So grab your coffee (or beverage of choice), settle in, and let's unpack this incredibly eventful week in AI.Open-Source LLMsQwen 3 — “Hybrid Thinking” on TapAlibaba open-weighted the entire Qwen 3 family this week, releasing two MoE titans (up to 235 B total / 22 B active) and six dense siblings all the way down to 0 .6 B, all under Apache 2.0. Day-one support landed in LM Studio, Ollama, vLLM, MLX and llama.cpp.The headline trick is a runtime thinking toggle—drop “/think” to expand chain-of-thought or “/no_think” to sprint. On my Mac, the 30 B-A3B model hit 57 tokens/s when paired with speculative decoding (drafted by the 0 .6 B sibling).Other goodies:* 36 T pre-training tokens (2 × Qwen 2.5)* 128 K context on ≥ 8 B variants (32 K on the tinies)* 119-language coverage, widest in open source* Built-in MCP schema so you can pair with Qwen-Agent* The dense 4 B model actually beats Qwen 2.5-72B-Instruct on several evals—at Raspberry-Pi footprintIn short: more parameters when you need them, fewer when you don’t, and the lawyers stay asleep. Read the full drop on the Qwen blog or pull weights from the HF collection.Performance & Efficiency: "Sonnet at Home"?The benchmarks are where things get really exciting.* The 235B MoE rivals or surpasses models like DeepSeek-R1 (which rocked the boat just months ago!), O1, O3-mini, and even Gemini 2.5 Pro on coding and math.* The 4B dense model incredibly beats the previous generation's 72B Instruct model (Qwen 2.5) on multiple benchmarks! 🤯* The 30B MoE (with only 3B active parameters) is perhaps the star. Nisten pointed out people are getting 100+ tokens/sec on MacBooks. Wolfram achieved an 80% MMLU Pro score locally with a quantized version. The efficiency math is crazy – hitting Qwen 2.5 performance with only ~10% of the active parameters.Nisten dubbed the larger model "Sonnet 3.5 at home," and while acknowledging Sonnet still has an edge in complex "vibe coding," the performance, especially in reasoning and tool use, is remarkably close for an open model you can run yourself.I ran the 30B MoE (3B active) locally using LLM Studio (shoutout for day-one support!) through my Weave evaluation dashboard (Link). On a set of 20 hard reasoning questions, it scored 43%, beating GPT 4.1 mini and nano, and getting close to 4.1 – impressive for a 3B active parameter model running locally!Phi-4-Reasoning — 14B That Punches at 70B+Microsoft’s Phi team layered 1.4 M chain-of-thought traces plus a dash of RL onto Phi-4 to finally ship a resoning Phi and shipped two MIT-licensed checkpoints:* Phi-4-Reasoning (SFT)* Phi-4-Reasoning-Plus (SFT + RL)Phi-4-R-Plus clocks 78 % on AIME 25, edging DeepSeek-R1-Distill-70B, with 32 K context (stable to 64 K via RoPE). Scratch-pads hide in tags. Full details live in Microsoft’s tech report and HF weights.It's fascinating to see how targeted training on reasoning traces and a small amount of RL can elevate a relatively smaller model to compete with giants on specific tasks.Other Open Source Updates* MiMo-7B: Xiaomi entered the ring with a 7B parameter, MIT-licensed model family, trained on 25T tokens and featuring rule-verifiable RL. (HF model hub)* Helium-1 2B: KyutAI (known for their Mochi voice model) released Helium-1, a 2B parameter model distilled from Gemma-2-9B, focused on European languages, and licensed under CC-BY 4.0. They also open-sourced 'dactory', their data processing pipeline. (Blog, Model (2 B), Dactory pipeline)* Qwen 2.5 Omni 3B: Alongside Qwen 3, the Qwen team also updated their existing Omni model with a 3B model, that retains 90% of the comprehension of its big brother with a 50% VRAM drop! (HF)* JetBrains open sources Mellum: Trained on over 4 trillion tokens with a context window of 8192 tokens across multiple programming languages, they haven't released any comparable eval benchmarks though (HF)Big Companies & APIs: Drama, Departures, and DeploymentsWhile open source stole the show, the big players weren't entirely quiet... though maybe some wish they had been.Farewell, GPT-4: Rest In Prompted 🙏Okay folks, let's take a moment. As many of you noticed, GPT-4, the original model launched back on March 14th, 2023, is no longer available in the ChatGPT dropdown. You can't select it, you can't chat with it anymore.For us here at ThursdAI, this feels significant. GPT-4's launch was the catalyst for this show. We literally started on the same day. It represented such a massive leap from GPT-3.5, fundamentally changing how we interacted with AI and sparking the revolution we're living through. Nisten recalled the dramatic improvement it brought to his work on Dr. Gupta, the first AI doctor on the market.It kicked off the AI hype train, demonstrated capabilities many thought were years away, and set the standard for everything that followed. While newer models have surpassed it, its impact is undeniable.The community sentiment was clear: Leak the weights, OpenAI! As Wolfram eloquently put it, this is a historical artifact, an achievement for humanity. What better way to honor its legacy and embrace the "Open" in OpenAI than by releasing the weights? It would be an incredible redemption arc.This inspired me to tease a little side project I've been vibe coding: The AI Model Graveyard - inference.rip . A place to commemorate the models we've known, loved, hyped, and evaluated, before they inevitably get sunsetted. GPT-4 deserves a prominent place there. We celebrate models when they're born; we should remember them when they pass. (GPT-4.5 is likely next on the chopping block, by the way). - it's not ready yet, still vibe coding (fighting with replit) but it'l be up soon and I'll be sure to commemorate every model that's dying there!So, pour one out for GPT-4. You changed the game. Rest In Prompt 🪦.The ChatGPT "Glazing" Incident: A Cautionary TaleSpeaking of OpenAI...oof. The last couple of weeks saw ChatGPT exhibit some... weird behavior. Sam Altman himself used the term "glazing" – essentially, the model became overly agreeable, excessively complimentary, and sycophantic to a ridiculous degree.Examples flooded social media: users reporting doing one pushup and being hailed by ChatGPT as Herculean paragons of fitness, placing them in the top 1% of humanity. Terrible business ideas were met with effusive praise and encouragement to quit jobs.This wasn't just quirky; it was potentially harmful. As Yam pointed out, people use ChatGPT for advice on serious matters, tough conversations, and personal support. A model that just mindlessly agrees and validates everything, no matter how absurd, isn't helpful – it's dangerous. It undermines trust and critical thinking.The community backlash was swift and severe. The key issue, as OpenAI admitted in their Announcement and AMA with Joanne Jiang (Head of Model Behavior), seems to stem from focusing too much on short-term engagement feedback and not fully accounting for long-term user interaction, especially with memory now enabled.In an unprecedented move, OpenAI rolled back the update. I honestly can't recall them ever publicly rolling back a model behavior change like this before. It underscores the severity of the issue.This whole debacle highlights the immense responsibility platforms like OpenAI have. When your model is used by half a billion people daily, including for advice and support, haphazard releases that drastically alter its personality without warning are unacceptable. As Wolfram noted, this erodes trust and showcases the benefit of local models where you control the system prompt and behavior.My takeaway? Critical thinking is paramount. Don't blindly trust AI, especially when it's being overly complimentary. Get second opinions (from other AIs, and definitely from humans!). I hope OpenAI takes this as a serious lesson in responsible deployment and testing.BREAKING NEWS: Claude.ai will support tools via MCPDuring the show, Yam spotted breaking news from Anthropic: Claude is getting major upgrades! (Tweet)They announced Integrations, allowing Claude to connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare, PayPal, and more (launch partners). Developers can apparently build their own integrations quickly too. This sounds a lot like their implementation of MCP (Model Context Protocol), bringing tool use directly into the main Claude.ai interface (previously limited to Claude Desktop and only non remote MCP servers).This feels like a big deal! Google Updates & LlamaCon Recap* Google: NotebookLM's AI audio overviews are now multilingual (50+ languages!) (X Post). Gemini 2.5 Flash (the faster, cheaper model) was released shortly after our last show, featuring hybrid reasoning with an API knob to control thinking depth. Rumors are swirling about big drops at Google I/O soon!* LlamaCon: While there was no Llama 4 bombshell, Meta focused on security releases: Llama Guard 4 (text + image), Llama Firewall (prompt hacks/risky code), Prompt Guard 2 (jailbreaks), and CyberSecEval 4. Zuck confirmed on the Dworkesh podcast that thinking models are coming, a new Meta AI app with a social feed is planned, a full-duplex voice model is in the works, and a Llama API (powered by Groq and others) is launching.This Week's Buzz from Weights & Biases 🐝Quick updates from my corner at Weights & Biases:* WeaveHacks Hackathon (May 17-18, SF): Get ready! We're hosting a hackathon focused on Agent Protocols – MCP and A2A. Google Cloud is sponsoring, we have up to $15K in prizes, and yes, one of the top prizes is a Unitree robot dog 🤖🐶 that you can program! (I expensed a robot dog, best job ever!). Folks from the Google A2A team will be there too. Come hack with us in SF! Apply here. It's FREE to participate!* Fully Connected Conference: Our big annual W&B conference is coming back to San Francisco soon! Expect amazing speakers (last year, Meta announced Llama 3!). Tickets are out: fullyconnected.com.Evals Deep Dive with Hamel Husain & Shreya ShankarAmidst all the model releases and drama, we were incredibly lucky to have two leading experts in AI evaluation, Hamel Husain (@HamelHusain) and Shreya Shankar (@sh_reya), join us.Their core message? Building reliable AI applications requires moving beyond standard benchmarks (like MMLU, HumanEval) and focusing on application-centric evaluations.Key Takeaways:* Foundation vs. Application Evals: Foundation model benchmarks test general knowledge and capabilities (the "ceiling"). Application evals focus on specific use cases, targeting reliability and identifying bespoke failure modes (like tone, hallucination on specific entities, instruction following) – aiming for 90%+ accuracy on your task.* Look At Your Data! This was the mantra. Off-the-shelf metrics (hallucination score, toxicity) can be misleading. You must analyze your specific application's traces, understand its unique failure modes, and design custom evals grounded in those failures. It's detective work.* PromptEvals Release: Shreya discussed their new work, PromptEvals (NAACL paper, Dataset, Models). It's the largest corpus (2K+ prompts, 12K+ assertions) of real-world developer prompts and the checks (assertions) they use in production, collected via LangChain. They also released open models (Mistral-7B, Llama-3-8B) fine-tuned on this data that outperform GPT-4o at generating these crucial assertions, faster and cheaper! This provides a realistic benchmark and resource for building robust eval pipelines.* Benchmark Gaming & Eval Complexity: We touched upon the dangers of optimizing for static benchmarks (like the Chatbot Arena issues) and the inherent complexity of evaluation – even human preferences change over time ("Who validates the validators?"). Meta-evaluation is crucial.* Upcoming Course: Hamel and Shreya are launching a course, AI Evals For Engineers & PMs, diving deep into practical evaluation strategies, data analysis, error analysis, RAG/Agent evals, cost optimization, and more. ThursdAI listeners get a 35% discount using code thursdai! (Link). I'm thrilled to be a guest speaker too! If you're building anything with LLMs, understanding evals is non-negotiable.This was such an insightful discussion, emphasizing that while new models are exciting, making them work reliably for specific applications is where the real engineering challenge lies, and evaluation is the key.Vision & Video: Runway Gets ConsistentThe world of AI video generation continues its rapid evolution.Runway References: Consistency UnlockedA major pain point in AI video has been maintaining consistency – characters changing appearance, backgrounds morphing frame-to-frame. Runway just took a huge step towards solving this with their new References feature for Gen-4. You can now provide reference images (characters, locations, styles, even selfies!) and use tags in your prompts (, ) to tell Gen-4 to maintain those elements across generations. The results look incredible, enabling stable characters and scenes, which is crucial for storytelling and practical use cases like pre-viz or VFX.AI Art & DiffusionHiDream E1: Open Source Ghibli StyleA new contender in open-source image generation emerged: HiDream E1. (HF Link) This model, from Vivago.ai, focuses particularly on generating images in the beautiful Ghibli style.The weights are available (looks like Apache 2.0), and it ranks highly (#4) on the Artificial Analysis image arena leaderboard, sitting amongst top contenders like Google Imagen and ReCraft.Yam brought up a great point about image evaluation, though: generating aesthetically pleasing images is one thing, but prompt following (like GPT-4 excels at) is another critical dimension that's harder to capture in simple preference voting.Final Thoughts: Responsibility & Critical ThinkingPhew! What a week. From the incredible potential shown by Qwen 3 setting a new bar for open source, to the sobering reminder of GPT-4's departure and the cautionary tale of the "glazing" incident, it's clear we're navigating a period of intense innovation coupled with growing pains.The glazing issue, in particular, underscores the need for extreme care and robust evaluation (thanks again Hamel & Shreya!) when deploying models that interact with millions, potentially influencing decisions and well-being. As AI becomes more integrated into our lives – helping us boil eggs (yes, I ask it stupid questions too!), offering support, or even suggesting purchases – we must remain critical thinkers.Don't outsource your judgment entirely. Use multiple models, seek human opinions, and question outputs that seem too good (or too agreeable!) to be true. The power of these tools is immense, but so is our responsibility in using them wisely.Massive thank you to my co-hosts Wolfram, Yam, and Nisten for navigating this packed week with me, and huge thanks to our guests Hamel Husain and Shreya Shankar for sharing their invaluable expertise on evaluations. And of course, thank you to this amazing community – hitting 1000 listeners! – for tuning in, commenting, and sharing breaking news. Your engagement fuels this show!🔗 Subscribe to our show on Spotify: thursdai.news/spotify🔗 Apple: thursdai.news/apple🔗 Youtube: thursdai.news/yt (get in before 10K!)And for the full show notes and links visit👉 thursdai.news/may-1 👈We'll see you next week for another round of ThursdAI!Alex out. Bye bye!ThursdAI - May 1, 2025 - Show Notes and Links* Show Notes* MCP/A2A Hackathon - with A2A team and awesome judges! 🤖🐶 (lu.ma/weavehacks)* FullyConnected - Weights & Biases flagship 2 day conference (fullyconnected.com)* Course - AI Evals For Engineers & PMs Questions for Shreya Shankar & Hamel Husain (link Promo code 35% of for listeners of ThursdAI - thursdai)* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Hamel Housain - @HamelHusain* Shreya Shankar - @sh_reya* Open Source LLMs * Alibaba drops Qwen 3 - 2 MOEs, 6 dense (0.6B - 30B) (Blog, GitHub, HF, HF Demo, My tweet, Nathan breakdown)* Microsoft - Phi-4-reasoning 14B + Plus (X, ArXiv, Tech Report , HF 14B SFT)* MiMo-7B — Xiaomi’s MIT licensed model (HF)* KyutAI - Helium-1 2B - (Blog, Model (2 B), Dactory pipeline)* Qwen 2.5 omni updated (X)* Big CO LLMs + APIs* GPT-4 RIP - no longer in dropdown (RIP)* Google - NotebookLM AI overviews are now multilingual (X)* LlamaCon updates (X)* OpenAI ChatGPT "glazing" update - revert back and why it matters (Announcement, AMA)* Chatbot Arena Under Fire — “Leaderboard Illusion” vs. LMArena (Paper, Reply)* This weeks Buzz* MCP/A2A Hackathon - with A2A team and awesome judges! 🤖🐶 (lu.ma/weavehacks)* FullyConnected - Weights & Biases flagship 2 day conference (fullyconnected.com)* Vision & Video* Runway References - consistency in video generation (X)* AI Art & Diffusion & 3D* HiDream E1 (HF)* Agents, Tools & Interviews* OpenPipe - ART·E open-source RL-trained email research agent (X, Blog | GitHub | Launch thread)* PromptEvals - Interview with Shreya Shankar ( NAACL paper | Dataset | Models ) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
--------
1:30:21
ThursdAI - Apr 23rd - GPT Image & Grok APIs Drop, OpenAI ❤️ OS? Dia's Wild TTS & Building Better Agents!
Hey everyone, Alex here 👋Welcome back to ThursdAI! After what felt like ages of non-stop, massive model drops (looking at you, O3 and GPT-4!), we finally got that "chill week" we've been dreaming of since maybe... forever? It seems the big labs are taking a breather, probably gearing up for even bigger things next week (maybe some open source 👀).But "chill" doesn't mean empty! This week was packed with fascinating developments, especially in the open source world and with long-awaited API releases. We actually had time to dive deeper into things, which was a refreshing change. We had a fantastic lineup of guests joining us too: Kwindla Kramer (@kwindla), our resident voice expert, dropped in to talk about some mind-blowing TTS and her own open-source VAD release. Maziyar Panahi (@MaziyarPanahi) gave us the inside scoop on OpenAI's recent meeting with the open source community. And Dex Horthy (@dexhorthy) from HumanLayer shared some invaluable insights on building robust AI agents that actually work in the real world. It was great having them alongside the usual ThursdAI crew: LDJ, Yam, Wolfram, and Nisten!So, instead of rushing through a million headlines, we took a more relaxed pace. We explored NVIDIA's cool new Describe Anything model, dug into Google's Quantization Aware Training for Gemma, celebrated the much-anticipated API release for OpenAI's GPT Image generation (finally!), checked out the new Grok API, got absolutely blown away by a tiny, open-source TTS model from Korea called Dia, and debated the principles of building better AI agents. Plus, a surprise drop from Send AI with a powerful video model!Let's dive in!Open Source AI Highlights: Community, Vision, and EfficiencyEven with the big players quieter on the model release front, the open source scene was buzzing. It feels like this "chill" period gave everyone a chance to focus on refining tools, releasing datasets, and engaging with the community.OpenAI Inches Closer to Open Source? Insights from the Community MeetingPerhaps the biggest non-release news of the week was OpenAI actively engaging with the open source community. Friend of the show Maziyar Panahi was actually in the room (well, the Zoom room) and joined us to share what went down It sounds like OpenAI came prepared, with Sam Altman himself spending significant time answering questions . Maziyar gave us the inside scoop, mentioning that OpenAI's looking to offload some GPU pressure by embracing open source – a win-win where they help the community, and the community helps lighten their load. He painted a picture of a company genuinely trying to listen and figure out how to best contribute. It felt less like a checkbox exercise and more like genuine engagement, which is awesome to see.What did the community ask for? Based on Maziyar's recap, there was a strong consensus on several key points:* Model Size: The sweet spot seemed to be not tiny, but not astronomically huge either. Something in the 70B-200B parameter range that could run reasonably on, say, 4 GPUs, leaving room for other models. People want power they can actually use without needing a supercomputer.* Capabilities: A strong desire for reliable structured output. Surprisingly, there was less emphasis on complex, built-in reasoning, or at least the ability to toggle reasoning off. This likely stems from practical concerns about cost and latency in production environments. The community seems to value control and efficiency for specific tasks.* Multilingual: Good support for European languages (at least 20) was a major request, reflecting the global nature of the open source community. Needs to be as good as English support.* Base Models: A huge ask was for OpenAI to release base models. The reasoning? Empower the community to handle fine-tuning for specific tasks like coding, roleplay, or supporting underrepresented languages . Let the experts in those niches build on a solid foundation.* Focus: Usefulness over chasing leaderboard glory. The community urged OpenAI to provide a solid, practical model rather than aiming for a temporary #1 spot that gets outdated in days or weeks . Stability, reliability, and long-term utility were prized over fleeting benchmark wins.* Safety: A preference for separate guardrail models (similar to LlamaGuard or GemmaGuard) rather than overly aligning the main model, which often hurts performance and flexibility . Give users the tools to implement safety layers as needed, rather than baking in limitations that might stifle creativity or utility.Perhaps most excitingly, Maziyar mentioned OpenAI seemed committed to regular open model releases, not just a one-off thin=! This, combined with recent moves like approving a community Pull Request to make their open-source Codex agent work with non-OpenAI models (as Yam Peleg excitedly pointed out!), suggests a potentially significant shift. Remember, it's been a long time since GPT-2 and Whisper were OpenAI's main open contributions! We're definitely watching this space closely. Huge shout out to OpenAI for listening and engaging with the builders.ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.NVIDIA's DAM: Describe Anything Model (and Dataset!)NVIDIA dropped something really cool this week: the Describe Anything Model (DAM), specifically DAM-3B, a 3 billion parameter multimodal model for region-based image and video captioning. Think Meta's Segment Anything (SAM), but instead of just segmenting, it also tells you what you've segmented, in detail.We played around with the image demo on the show (HF demo) . You hover over an image, things get segmented on the fly (you can use points, boxes, scribbles, or masks), you click, and boom – a detailed description pops up for that specific region: "A brown bear with a thick, dense coat of fur..." . It's pretty slick and responsive!While the demo didn't showcase video, the project page (X post) shows it working on videos too (DAM-3B-Video), tracking and describing objects like fish even as they move. This capability really impressed Yam, who rightly pointed out that tracking objects consistently over video is hard, so having a base model that understands this level and embeds it in language is seriously impressive. The model uses a "focal prompt" and gated cross-attention to fuse the full scene context with the selected region.Nisten reminded us that our friend Piotr Skalski from Roboflow basically built a pipeline for this a while back by combining SAM with description models like Microsoft Florence . But DAM integrates it all into one efficient 3B parameter model (HF model), setting a new state-of-the-art on their introduced DLC-Bench (Detailed Localized Captioning).Crucially, NVIDIA didn't just drop the model; they also released the Describe Anything Dataset (HF dataset) used to train it (built on subsets like COCO, Paco, SAM) and the code under a research-only license. This is fantastic for researchers and builders. Imagine using this for precise masking before sending an image to the new GPT Image API for editing – super useful! Big props to NVIDIA and their collaborators at UC Berkeley and UCSF for this contribution.Gemma Gets Quantization Aware Training (QAT): Smaller Footprint, Sassy Attitude?Google also pushed the open source envelope by releasing Gemma models trained with Quantization Aware Training (QAT). This isn't your standard post-training quantization; QAT involves incorporating the impact of quantization during the training process itself. As LDJ explained, this allows the model to adapt, potentially resulting in a quantized state with much higher quality and less performance degradation compared to just quantizing a fully trained model afterwards.The results? Significant reductions in VRAM requirements across the board. The 27B parameter Gemma 3, for example, drops from needing a hefty 54GB to just 14.1GB ! Even the 1B model goes from 2GB to just half a gig. This makes running these powerful models much more accessible on consumer hardware. Folks are already running them in MLX, llama.cpp, LM Studio, etc. (Reddit thread)Wolfram already took the 4B QAT model for a spin using LM Studio . The good news: it ran easily, needing only 5-6GB of RAM. The quirky news: it seemed to struggle a bit with prompt adherence in his tests, even giving Wolfram a sassy, winking-emoji response about ignoring the "fine print" in his complex system prompt when called out on a language switching error: "Who reads a fine print? 😉" ! He did note Gemma 3 now supports system prompts (unlike Gemma 2), which is a definite improvement .(While NVIDIA also released OpenMath Nemotron, we didn't dive deep in the show, but worth noting its AIMO win and accompanying open dataset release!)Voice and Audio Innovations: Emotional TTS and Smarter ConversationsEven in a "chill" week, the audio space delivered some serious excitement. Kwindla Kramer joined us to break down two major developments.Dia TTS: Unhinged Emotion from a Small Open Model 🤯This one absolutely blew up Twitter, and for good reason. Dia, from Nari Labs (essentially a student and a half in Korea!), is a 1.6 billion parameter open-weights (MIT licensed) text-to-dialogue model (Github, HF). What makes it special? The insane emotional range and natural interaction patterns. My Twitter post about it (X post) went viral, getting half a million views !We played some examples, and they are just wild. You have to hear this to believe it:* Check the Demos: Dia Demo Page | Fal.ai Voice Clone DemoAnother crazy thing is how it handles non-verbal cues like laughs or coughs specified in the text (e.g., (laughs)) . Instead of just tacking on a generic sound, it inflects the preceding words leading into the laugh, making it sound incredibly natural. It even handles interruptions seamlessly, cutting off one speaker realistically when another starts .Kwin, our voice expert, offered some valuable perspective . While Dia is undeniably awesome and shows what's possible, it's very much a research model – likely unpredictable ("unhinged" was his word!) and probably required cherry-picking the best demos. Production models like 11Labs need predictability. Kwin also noted the dataset is probably scraped from YouTube (a common practice, explaining the lack of open audio data) and that the non-speech sounds are a key takeaway – the bar for TTS is rising beyond just clear speech .PipeCat SmartTurn: Fixing Awkward AI Silences with Open Source VADSpeaking of open audio, Kwin and the team at Daily/Pipecat had their own breaking news: they released an open-source checkpoint for their SmartTurn model – a semantic Voice Activity Detection (VAD) system (Github, HF Model) What's the problem SmartTurn solves? That annoying thing where voice assistants interrupt you mid-thought just because you paused for a second. I've seen this happen with my kids all the time, making interaction frustrating! Semantic VAD, or "Smart Turn," is much smarter. It considers not just silence but also the context – audio patterns (like intonation suggesting you're not finished) and linguistic cues (like ending on "and..." or "so...") to make a much better guess about whether you're truly done talking. This is crucial for natural-feeling voice interactions, especially for kids or multilingual speakers (like me!) who might pause more often to find the right word.And the data part is key here. They're building an open dataset for this, hosted on Hugging Face. You can even contribute your own voice data by playing simple games on their turn-training.pipecat.ai site (Try It Demo)! The cool incentive? The more diverse voice data they get (especially for different languages!), the better these systems will work for everyone. If your voice is in the dataset, future AI agents might just understand you a little better!Kwin also mentioned their upcoming Voice AI Course co-created with friend-of-the-pod Swyx, hosted on Maven . It aims to be a comprehensive guide with code samples, community interaction, and insights from experts (including folks from Weights & Biases!). Check it out if you want to dive deep into building voice AI. AI Art & Diffusion & 3D: Quick HitsA slightly quieter week for major art model releases, but still some significant movement:* OpenAI's GPT Image 1 API: We'll cover this in detail in the Big Companies section below, but obviously relevant here too as a major new tool for developers creating AI art and image editing applications .* Hunyuan 3D 2.5 (Tencent): Tencent released an update to their 3D generation model, now boasting 10 billion parameters (up from 1B!) . They're highlighting massive leaps in precision (1024-resolution geometry), high-quality textures with PBR support, and improved skeletal rigging for animation X Post. Definitely worth keeping an eye on as 3D generation matures and becomes more accessible (they doubled the free quota and launched an API).Agent Development Insights: Building Robust Agents with Dex HorthyWith things slightly calmer, it was the perfect time to talk about AI agents – a space buzzing with activity, frameworks, and maybe even a little bit of drama. We brought in Dex Horthy, founder of HumanLayer and author of the insightful "12 Factor Agent" essay (Github Repo), to share his perspective on what actually works when building agents for production.Dex builds SDKs to help create agents that feel more like digital humans, aiming to deploy them where users already are (Slack, email, etc.), moving beyond simple chat interfaces. His experience led him to identify common patterns and pitfalls when trying to build reliable agents.The Problem with Current Agent FrameworksA key takeaway Dex shared? Many teams building serious, production-ready agents end up writing large parts from scratch. Why? Because existing frameworks often fall short in providing the necessary control and reliability for complex tasks. The common "prompt + bag of tools + figure it out" approach, while great for demos, struggles with reliability over longer, multi-step workflows . Think about it: even if each step is 92% reliable, after 10 steps, your overall success rate plummets due to compounding errors. That's just not good enough for customer-facing applications.Key Principles: Small Agents, Owning ContextSo, what does work today according to Dex's 12 factors?* Small, Focused Agents: Instead of one giant, monolithic agent trying to do everything, the more reliable approach is to build smaller "micro-agents" that handle specific, well-defined parts of a workflow ]. As models get smarter, these micro-agents might grow in capability, but the principle of breaking down complexity holds. Find something at the edge of the model's capability and nail it consistently .* Own Your Prompts & Context: Don't let frameworks abstract away control over the exact tokens going into the LLM or how the context window is managed. This is crucial for performance tuning. Even with massive context windows (like Gemini's 2M tokens), smaller, carefully curated context often yields better results and lower costs . Maximum performance requires owning every single token.Dex's insights provide a crucial dose of pragmatism for anyone building or thinking about building AI agents in this rapidly evolving space. Check out his full 12 Factor Agent essay and the webinar recording for a deeper dive.Big Companies & APIs: GPT Image and Grok Get Developer AccessWhile new foundation models were scarce from the giants this week, they did deliver on the API front, opening up powerful capabilities to developers.OpenAI Finally Releases GPT Image 1 API! (X Post)This was a big one many developers were waiting for. OpenAI's powerful image generation capabilities, previously locked inside ChatGPT, are now available via API under the official name gpt-image-1 (Docs) . No more awkward phrasing like "the new image generation capabilities within chat gpt"!Getting access requires organizational verification, which involved a slightly intense biometric scan process for me – feels like they're taking precautions given the model's realism and potential for misuse . Understandable, but something developers need to be aware of .The API (API Reference) offers several capabilities:* Generations: Creating images from scratch based on text prompts.* Edits: Modifying existing images using a new prompt, crucially supporting masking for partial edits. This is huge for targeted changes and perfect for combining with segmentation models like NVIDIA's DAM!There's a nice playground interface in the console, and you have interesting controls over the output:* Quality: Instead of distinct models, you select a quality level (standard/HD) which impacts the internal "thinking time" and cost . It seems to be a reasoning model under the hood, so quality relates to compute/latency.* Number: Generate up to 10 images at once.* Transparency: Supports generating images with transparent backgroundsI played around with it, generating ads and even trying to get it to make a ThursdAI thumbnail with my face. The text generation is excellent – it nailed "ThursdAI" perfectly on an unhinged speaker ad Nisten prompted! It follows complex style prompts well.However, generating realistic faces, especially matching a specific person like me, seems... really hard right now . Even after many attempts providing a source image and asking it to replace a face, the results were generic or only vaguely resembled me. It feels almost intentionally nerfed, maybe as a safety measure to prevent deepfakes? I still used it for the thumbnail, but yeah, it could be better on faces.OpenAI launched with several integration partners like Adobe, Figma, Wix, HeyGen, and Fal.ai already onboard. Expect to see these powerful image generation capabilities popping up everywhere!Grok 3 Mini & Grok 3 Now Available via API (+ App Updates)Elon's xAI also opened the gates this week, making Grok 3 Mini and Grok 3 available via API (Docs).The pricing structure is fascinating and quite different from others. Grok 3 Mini is incredibly cheap for input ($0.30 / 1M tokens) with only a modest bump for output ($0.50 / 1M). The "Fast" versions, however, cost significantly more, especially for output tokens (Grok 3 Fast is $5 input / $25 output per million!) . It seems like a deliberate play on the "fast, cheap, smart" triangle, giving developers explicit levers to pull based on their needs.Benchmarks provided by xAI position Grok 3 Mini competitively against other small models like Gemini 2.5 Flash and O4 Mini, scoring well on AIME (93%) and coding benchmarks. Speaking of the app, the iOS version got a significant update adding a live video view (let Grok see what you see through your camera) and multilingual audio support (X Post) . Prepare for some potentially unhinged, real-time video roasting if you use the fun mode with the camera on ! Multilingual audio and search are also rolling out to SuperGrok users on Android.(Side note: We briefly touched on O3's recent wonkiness in following instructions for tone, despite its amazing GeoGuessr abilities! Something feels off there lately.)Vision and Video: Send AI's Surprise Release & MoreJust when we thought the week was winding down on model releases...Send AI Drops MAGI-1: 24B Video Model with Open Weights! 🔥Out of seemingly nowhere, a company called Send AI released details (and then the weights!) for MAGI-1, a 24 billion parameter autoregressive diffusion model for video generation (X Post, GitHub, PDF Report).The demos looked stunning, showcasing impressive long-form video generation with remarkable character consistency – often the Achilles' heel of AI video . Nisten speculated this could be a major step towards usable AI-generated movies, solving the critical face/character consistency problem . They achieve this by predicting video in 24-frame chunks with causal attention between them, allowing for real-time streaming generation where compute doesn't scale with length. They also highlighted an "infinite extension" capability, allowing users to build out longer scenes by injecting new prompts or continuing footage.Their technical report dives into the architecture, mentioning novel techniques like a custom "MagiAttention" kernel that scales to massive contexts and helps achieve the temporal consistency. It also sets SOTA on VBench-I2V and Physics-IQ benchmarks.And the biggest surprise? They released the model weights under an Apache 2.0 license on Hugging Face ! This is huge! Just as we sometimes lament the lack of open source momentum from certain players, Send AI drops this 24B parameter beast with open weights. Amazing! Go download it!Framepack: Long Videos on Low VRAMWolfram also flagged Framepack, another interesting video development from the research world from the creator of ControlNet. FramePack is a next-frame (next-frame-section) prediction neural network structure that generates videos progressively. (Github)Character AI AvatarFX Steps InAlso in the visual space, Character AI announced AvatarFX in early access (Website), stepping into the realm of animated, speaking visual avatars derived from images. It seems like everyone wants to bring characters to life visually now.This Week's Buzz from W&B / CommunityQuick hits on upcoming events and community stuff:* WeaveHacks Coming to SF! Mark your calendars! We're hosting a hackathon focused on building with W&B Weave at the Weights & Biases office in San Francisco on May 17th-18th [0:06:15]. If you're around, especially if you're coming into town for Google I/O the week after, come hang out, build cool stuff, and say hi! We're planning to go all out with sponsors and prizes (announcements coming soon). lu.ma/weavehacks * Fully Connected Conference Reminder: Our flagship W&B conference, Fully Connected, is happening in San Francisco on June 18th [0:06:30]. It's where our customers, partners, and the community come together for two days of talks, workshops, and networking focused on production AI. It's always an incredible event. (fullyconnected.com)Wrapping Up the "Chill" Week That Wasn't Quite ChillPhew! See? Even a "chill" week in AI is overflowing with news when you actually have time to stop and breathe for a second. From OpenAI's fascinating open source tango and the practical (and long-awaited!) API releases of GPT Image and Grok, to the sheer creative potential shown by indie projects like Dia and Send AI's Maggie, and the grounding principles for building agents that actually work from Dex – there was a ton to absorb and discuss. It felt good to have the space to go a little deeper.It was fantastic having Kwin, Maziar, and Dex join the regulars (LDJ, Yam, Wolfram, Nisten) to share their expertise and firsthand insights. A huge thank you to them and to everyone tuning in live across X, YouTube, LinkedIn, and participating in the chat! Your questions and comments make the show what it is.Don't forget, if you missed anything, the full show is available as a podcast (search "ThursdAI" wherever you get your podcasts)🔗 Subscribe to our show on Spotify: thursdai.news/spotify🔗 Apple: thursdai.news/apple🔗 Youtube: thursdai.news/yt Next week? The rumors suggest the big labs might be back with major releases . The brief calm might be over! Buckle up! We'll be here to break it all down.See you next ThursdAI!- AlexTL;DR and Show Notes (April 23rd, 2024)* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases @altryne* Co Hosts - Wolfram Ravenwlf @WolframRvnwlf, Yam Peleg @yampeleg, Nisten Tahiraj @nisten, LDJ @ldjconfirmed* Kwindla Kramer @kwindla - Daily Co-Founder // Voice expert* Dexter Horthy @dexhorthy - HumanLayer // Agents expert* Maziyar Panahi @MaziyarPanahi - OSS maintainer* Open Source AI - LLMs, Vision, Voice & more* OpenAI OSS Meeting: Insights from Maziar [0:16:37]. * NVIDIA Describe Anything (DAM-3B): 3B param multimodal LLM for region-based image/video captioning. (X Post, HF model, HF demo)* Google Gemma QAT: Quantization-Aware Training models (X, Blog) * Big CO LLMs + APIs* OpenAI GPT Image 1 API: (X Post, Docs, API Reference)* Grok API & App Updates: Grok 3 and Grok 3 Mini available via API. (API Docs, App Update X Post)* This weeks Buzz - Weights & Biases* WeaveHacks SF: Hackathon May 17-18 at W&B HQ. lu.ma/weavehacks * Fully Connected: W&B's 2-day conference, June 18-19 in SF fullyconnected.com* Vision & Video* Send AI MAGI-1: 24B autoregressive diffusion model for long, streaming video (X Post, GitHub, PDF Report, HF Repo)* Character AI AvatarFX: Early access for creating speaking/emoting avatars from images . (Website)* Framepack: Mentioned for long video generation (120s) on low VRAM (6GB). (Project Page)* Voice & Audio* Nari Labs Dia: 1.6B param OSS TTS model (X Post Highlight, HF Model, Github, Fal.ai Demo)* PipeCat Smart-Turn VAD: Open source semantic VAD model (Github, HF Model, Fal.ai Playground, Try It Demo)* AI Art & Diffusion & 3D* Hunyuan 3D 2.5 (Tencent): 10B param update [0:09:06]. Higher res geometry, PBR textures, improved rigging. (X Post)* Agents , Tools & Links* 12 Factor Agents: Discussion with Dex Horthy on building robust agents (Github Repo) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
--------
1:36:54
ThursdAI - Apr 17 - OpenAI o3 is SOTA llm, o4-mini, 4.1, mini, nano, G. Flash 2.5, Kling 2.0 and 🐬 Gemma? Huge AI week + A2A protocol interview
Hey everyone, Alex here 👋Wow. Just… wow. What a week, folks. Seriously, this has been one for the books. This week was dominated by OpenAI's double whammy: first the GPT-4.1 family dropped with a mind-boggling 1 million token context window, followed swiftly by the new flagship reasoning models, o3 and o4-mini, which are already blowing minds with their agentic capabilities. We also saw significant moves from Google with VEO-2 going GA, the fascinating A2A protocol launch (we had an amazing interview with Google's Todd Segal about it!), and even an attempt to talk to dolphins with DolphinGemma. Kling stepped up its video game, Cohere dropped SOTA multimodal embeddings, and ByteDance made waves in image generation. Plus, the open-source scene had some interesting developments, though perhaps overshadowed by the closed-source giants this time.o3 has absolutely taken the crown as the conversation piece, so lets start with it (as always, TL;DR and shownotes at the end, and here's the embedding of our live video show) Big Company LLMs + APIsOpenAI o3 & o4‑mini: SOTA Reasoning Meets Tool‑Use (Blog, Watch Party)The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT disappoint and well.. even surprised! o3 is not only SOTA on nearly all possible logic, math and code benchmarks, which is to be expected from the top reasoning model, it also, and I think for the first time, is able to use tools during its reasoning process. Tools like searching the web, python coding, image gen (which it... can zoom and rotate and crop images, it's nuts) to get to incredible responses faster. Tool using reasoner are... almost AGI? This is the headline feature for me. For the first time, these o-series models have full, autonomous access to all built-in tools (web search, Python code execution, file search, image generation with Sora-Image/DALL-E, etc.). They don't just use tools when told; they decide when and how to chain multiple tool calls together to solve a problem. We saw logs with 600+ consecutive tool calls! This is agent-level reasoning baked right in.Anecdote: We tested this live with a complex prompt: "generate an image of a cowboy that on his head is the five last digits of the hexadecimal code of the MMMU score of the latest Gemini model." o3 navigated this multi-step task flawlessly: figuring out the latest model was Gemini 2.5, searching for its MMMU score, using the Python tool to convert it to hex and extract the digits, and then using the image generation tool. It involved multiple searches and reasoning steps. Absolutely mind-blowing 🤯.Thinking visually with imagesThis one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received. Think... cropping, zooming, rotating. The models can now perform all these tasks to multimodal requests from users. Sci-fi stuff! Benchmark Dominance: As expected, these models crush existing benchmarks.o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more. It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k!o4-mini is no slouch either. It hits 99.5% on AIME (math problems) when allowed to use its Python interpreter and beats the older o3-mini on general tasks. It’s a reasoning powerhouse at a fraction of the cost.Incredible Long Context PerformanceYam highlighted this – on the Fiction Life benchmark testing deep comprehension over long contexts, o3 maintained nearly 100% accuracy up to 120,000 tokens, absolutely destroying previous models including Gemini 2.5 Pro and even the new GPT-4.1 family on this specific eval. While its context window is currently 200k (unlike 4.1's 1M), its performance within that window is unparalleled.Cost-Effective Reasoning: They're not just better, they're cheaper for the performance you get.* o3: $10 input / $2.50 cached / $40 output per million tokens.* o4-mini: $1.10 input / $0.275 cached / $4.40 output per million tokens. (Cheaper than GPT-4.0!)Compute Scaling Validated: OpenAI confirmed these models used >10x the compute of o1 and leverage test-time compute scaling (spending longer on harder problems), further proving their scaling law research.Memory Integration: Both models integrate with ChatGPT's recently upgraded memory feature which has access to all your previous conversations (which we didn't talk about but is absolutely amazing, try asking o3 stuff it knows about you and have ti draw conclusions!)Panel Takes & Caveats:While the excitement was palpable, Yam noted some community observations about potential "rush" – occasional weird hallucinations or questionable answers compared to predecessors, possibly a side effect of cramming so much training data. Nisten, while impressed, still found the style of GPT-4.1 preferable for specific tasks like generating structured medical notes in his tests. It highlights that benchmarks aren't everything, and specific use cases require evaluation (shameless plug: use tools like W&B Weave for this!).I'll add my own, I use all the models every week to help me draft posts, and o3 was absolute crap about matching my tone. % of what's written above it was able to mimic. Gemini remains undefeated for me and this task.Though, Overall, o3 and o4-mini feel like a paradigm shift towards more autonomous, capable AI assistants. The agentic future feels a whole lot closer.OpenAI Launches GPT-4.1 Family: 1 Million Tokens & Killing 4.5! (Our Coverage, Prompting guide)Before the o3 shockwave, Monday brought its own major AI update: the GPT-4.1 family. This was the API-focused release, delivering massive upgrades for developers.The Headline: One Million Token Context Window! 🤯 Yes, you read that right. All three new models – GPT-4.1 (flagship), GPT-4.1 mini (cheaper/faster), and GPT-4.1 nano (ultra-cheap/fast) – can handle up to 1 million tokens. This is a monumental leap, enabling use cases that were previously impossible or required complex chunking strategies.Key Details:Goodbye GPT-4.5! In a surprising twist, OpenAI announced they are deprecating the recently introduced (and massive) GPT-4.5 model within 90 days in the API. Why? Because GPT-4.1 actually outperforms it on key benchmarks like SW-Bench, Aider Polyglot, and the new long-context MRCR eval, while being far cheaper to run. It addresses the confusion many had: why was 4.5 seemingly worse than 4.1? It seems 4.5 was a scaling experiment, but 4.1 represents a more optimized, better-trained checkpoint on superior data. RIP 4.5, we hardly knew ye (in the API).The Prompt Sandwich Surprise! 🥪: This was wild. Following OpenAI's new prompting guide, I tested the "sandwich" technique (instructions -> context -> instructions again) on my hard reasoning eval using W&B Weave.For GPT-4.1, it made no difference (still got 48%). But for GPT-4.1 mini, the score jumped from 31% to 49% – essentially matching the full 4.1 model just by repeating the prompt! That's a crazy performance boost for a simple trick. Even nano saw a slight bump. Lesson: Evaluate prompt techniques! Don't assume they won't work.Million-Token Recall Confirmed: Using Needle-in-Haystack and their newly open-sourced MRCR benchmark (Multi-round Co-reference Resolution – more in Open Source), OpenAI showed near-perfect recall across the entire 1 million token window for all three models, even nano! This isn't just a theoretical limit; the recall seems robust.Multimodal Gains: Impressively, 4.1 mini hit 72% on Video-MME, pushing SOTA for long-video Q&A in a mid-tier model by analyzing frame sequences. 4.1 mini seems to be the absolute powerhouse of this release cycle, it nearly matches the intelligence of the previous 4o, while being significantly cheaper and much much faster with 1M context window! Windsurf (and Cursor) immediately made the 4.1 family available, offering a free week for users to test them out (likely to gather feedback and maybe influenced by certain acquisition rumors 😉). Devs reported them feeling snappier and less verbose than previous models.Who Should Use Which OpenAI API?My initial take:* For complex reasoning, agentic tasks, or just general chat: Use o3 (if you need the best) or o4-mini (for amazing value/speed).* For API development, especially coding or long-context tasks: Evaluate the GPT-4.1 family. Start with 4.1 mini – it's likely the sweet spot for performance/cost, especially with smart prompting. Use 4.1 if mini isn't quite cutting it. Use nano for simple, high-volume tasks like translation or basic classification.The naming is still confusing (thanks Nisten for highlighting the UI nightmare!), but the capability boost across the board is undeniable.Hold the Phone! 🚨 Google Fires Back with Gemini 2.5 Flash in Breaking NewsJust when we thought the week couldn't get crazier, Google, likely reacting to OpenAI's rapid-fire launches, just dropped Gemini 2.5 Flash into preview via the Gemini API (in AI Studio and Vertex AI). This feels like Google's direct answer, aiming to blend reasoning capabilities with speed and cost-effectiveness.The Big Twist: Controllable Thinking Budgets!Instead of separate models like OpenAI, Gemini 2.5 Flash tries to do both reasoning and speed/cost efficiency in one model. The killer feature? Developers can set a "thinking budget" (0 to 24,576 tokens) per API call to control the trade-off:* Low/Zero Budget: Prioritizes speed and low cost (very cheap: $0.15 input / $0.60 output per 1M tokens), great for simpler tasks.* Higher Budget: Allows the model multi-step reasoning "thinking" for better accuracy on complex tasks, at a higher cost ($3.50 output per 1M tokens, including reasoning tokens).This gives granular control over the cost/quality balance within the same model.Performance & Specs:Google claims strong performance, ranking just behind Gemini 2.5 Pro on Hard Prompts in ChatBot Arena and showing competitiveness against o4-mini and Sonnet 3.7 in their benchmarks, especially given the flexible pricing.Key specs are right up there with the competition:* Multimodal Input: Text, Images, Video, Audio* Context Window: 1 million tokens (matching GPT-4.1!)* Knowledge Cutoff: January 2025How to Control Thinking:Simply set the thinking_budget parameter in your API call (Python/JS examples available in their docs). If unspecified, the model decides automatically.My Take: This is a smart play by Google. The controllable thinking budget is a unique and potentially powerful feature for optimizing across different use cases without juggling multiple models. With 1M context and competitive pricing, Gemini 2.5 Flash is immediately a major contender in the ongoing AI arms race. Definitely one to evaluate! Find more in the developer docs and Gemini Cookbook.Open Source: LLMs, Tools & moreOpenAI open sources MRCR eval and Codex (Mrcr HF, Codex Github)Let's face it, this isn't the open source OpenAI coverage I was hoping for, Sam promised us an open source model, and they are about to drop this, I'd assume close to Google IO (May 20th) to steal thunder. But OpenAI did make OpenSource waves this week in addition to the above huge stories. MRCR is a way to evaluate long context complex tasks, and they have taken this Gemini research and open sourced a dataset for this eval. 👏 But also, they have dropped the Codex CLI tool, which is a coding partner using o4-mini and o3 and made that tool open source as well (Unlike anthropic with Claude Code), which in turn saw 86+ Pull Requests approved within the first 24 hours! The best part about this CLI, is that it's hardened security, using Apple Seatbelt which limits it execution to the current directory + temp files (on a mac at least) Other Open Source UpdatesWhile OpenAI's contributions were notable, it wasn't the only action this week:* Microsoft's BitNet v1.5 (HF): Microsoft quietly dropped updates to BitNet, continuing their exploration into ultra-low-bit (ternary) models for efficiency. As Nisten pointed out on the show though, keep in mind these still use some higher-precision layers, so they aren't purely 1.5-bit in practice just yet. Important research nonetheless!* INTELLECT-2 Distributed RL (Blog, X): Prime Intellect did something wild – training INTELLECT-2, a 32B model, using globally distributed, permissionless reinforcement learning. Basically, anyone with a GPU could potentially contribute. Fascinating glimpse into decentralized training!* Z.ai (Formerly ChatGLM) & GLM-4 Family (X, HF, GitHub): The team behind ChatGLM rebranded to Z.ai and released their GLM-4 family (up to 32B parameters) under the very permissive MIT license. They're claiming performance competitive with much larger models like Qwen 72B, which is fantastic news for commercially usable open source!ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.This Week's Buzz: Playground Updates & A Deep Dive into A2AOn the Weights & Biases front, it's all about enabling developers to navigate this new model landscape.Weave Playground Supports GPT-4.1 and o3/o4-mini (X)With all these new models dropping, how do you actually choose which one is best for your application? You need to evaluate! Our W&B Weave Playground now has full support for the new GPT-4.1 family and the o3/o4-mini models.If you're using Weave to monitor your LLM apps in production, you can easily grab a trace of a real user interaction, open it in the Playground, and instantly retry that exact same call (with all its context and history) using any of the new models side-by-side. It’s the fastest way to see how o3 compares to 4.1-mini or how Claude 3.7 stacks up against o4-mini on your specific data. Essential for making informed decisions in this rapidly changing environment.Deep Dive: Understanding Google's A2A Protocol with Todd SegalThis was a highlight of the show for me. We were joined by Todd Segal, a Principal Software Engineer at Google working directly on the new Agent-to-Agent (A2A) protocol. There was some confusion initially about how A2A relates to the increasingly popular Model Context Protocol (MCP), so getting Todd's perspective was invaluable. W&B is a proud launch partner for the A2A protocol!Key Takeaways from our Chat:* A2A vs. MCP: Complementary, Not Competitive: Todd was clear: Google sees these as solving different problems. MCP is for Agents talking to Tools (structured, deterministic capabilities). A2A is for Agents talking to other Agents (unstructured, stateful, unpredictable, evolving interactions). Think of MCP like calling an API, and A2A like delegating a complex task to another expert service.* The Need for A2A: It emerged from the need for specialized, domain-expert agents (built internally or by partners like Salesforce) to collaborate on complex, long-running tasks (e.g., booking a multi-vendor trip, coordinating an enterprise workflow) where simple tool calls aren't enough. Google's Agent Space product heavily utilizes A2A internally.* Capability Discovery & Registries: A core concept is agents advertising their capabilities via an "agent card" (like a business card or resume). Todd envisions a future with multiple registries (public, private, enterprise-specific) where agents can discover other agents best suited for a task. This registry system is on the roadmap.* Async & Long-Running Tasks: A2A is designed for tasks that might take minutes, hours, or even days. It uses a central "Task" abstraction which is stateful. Agents communicate updates (status changes, generated artifacts, requests for more info) related to that task.* Push Notifications: For very long tasks, A2A supports a push notification mechanism. The client agent provides a secure callback URL, and the server agent can push updates (state changes, new artifacts) even if the primary connection is down. This avoids maintaining costly long-lived connections.* Multimodal Communication: The protocol supports negotiation of modalities beyond text, including rendering content within iframes (for branded experiences) or exchanging video/audio streams. Essential for future rich interactions.* Security & Auth: A2A deliberately doesn't reinvent the wheel. It relies on standard HTTP headers to carry authentication (OAuth tokens, internal enterprise credentials). Identity/auth handshakes happen "out of band" using existing protocols (OAuth, OIDC, etc.), and the resulting credentials are passed with A2A requests. Your user identity flows through standard mechanisms.* Observability: Todd confirmed OpenTelemetry (OTel) support is planned for the SDKs. Treating agents like standard microservices means leveraging existing observability tools (like W&B Weave!) is crucial for tracing and debugging multi-agent workflows.* Open Governance: While currently in a Google repo, the plan is to move A2A to a neutral foundation (like Linux Foundation) with a fully open governance model. They want this to be a true industry standard.* Getting Started: Check out the GitHub repo (github.com/google/A2A), participate in discussions, file issues, and send PRs!My take: A2A feels like a necessary piece of infrastructure for the next phase of AI agents, enabling complex, coordinated actions across different systems and vendors. While MCP handles the "how" of using tools, A2A handles the "who" and "what" of inter-agent delegation. Exciting times ahead! Big thanks to Todd for shedding light on this.Vision & Video: Veo-2 Arrives, Kling Gets SlickerThe visual AI space keeps advancing rapidly.Veo-2 Video Generation Hits GA in Vertex AI & Gemini App (Blog, Try It)Google's answer to Sora and Kling, Veo-2, is now Generally Available (GA) for all Google Cloud customers via Vertex AI. You can also access it in the Gemini app.Veo-2 produces stunningly realistic and coherent video, making it a top contender alongside OpenAI's Sora and Kling. Having it easily accessible in Vertex AI is a big plus for developers on Google Cloud.I've tried and keep tyring all of them, VEO2 is an absolute beast in realism. Kling 2.0 Creative Suite: A One-Stop Shop for Video AI? (X, Blog)Kuaishou's Kling model also got a major upgrade, evolving into a full Kling 2.0 Creative Suite.Anecdote: I actually stayed up quite late one night trying to piece together info from a Chinese live stream about this release! The dedication is real, folks. 😂What's New:* Kling 2.0 Master: The core video model, promising better motion, physics, and facial consistency (still 5s clips for now, but 30s/4K planned).* Kolors 2.0: An integrated image generation and restyling model (think Midjourney-style filters).* MVL (Multimodal Visual Language) Prompting: This is killer! You can now inline images directly within your text prompt for precise control (e.g., "Swap the hoodie in @video1 with the style of @image2"). This offers granular control artists have been craving.* Multi-Elements Editor: A timeline-based editor to stitch clips, add lip-sync, sound effects (including generated ones like "car horn"), and music.* Global Access: No more Chinese phone number requirement! Available worldwide at klingai.com.* Official API via FAL: Developers can now integrate Kling 2.0 via our friends at ⚡ FAL Generative Media Cloud.Kling is clearly aiming to be a holistic creative platform, reducing the need to jump between 17 different AI tools for image gen, video gen, editing, and sound. The MVL prompting is particularly innovative. Very impressive package.Voice & Audio: Talking to Dolphins? 🐬DolphinGemma: Google AI Listens to Flipper (Blog)In perhaps the most delightful news of the week, Google, in collaboration with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma. It's a ~400M parameter audio model based on the Gemma architecture (using SoundStream for audio tokenization) trained specifically on decades of recorded dolphin clicks, whistles, and pulses.The goal? To decipher the potential syntax and structure within dolphin communication and eventually enable rudimentary two-way interaction using underwater communication devices. It runs on a Pixel phone for field deployment.This is just awesome. Using AI not just for human tasks but to potentially bridge the communication gap with other intelligent species is genuinely inspiring. We joked on the show about doing a segment of just dolphin noises – maybe next time if DolphinGemma gets an API! 🤣AI Art & Diffusion & 3D: Seedream Challenges the ChampsSeedream 3.0: ByteDance's Bilingual Image Powerhouse (Tech post, arXiv, AIbase news)ByteDance wasn't just busy with video; their Seed team announced Seedream 3.0, a powerful bilingual text-to-image model.Highlights:* Generates native 2048x2048 images.* Fast inference (~3 seconds for 1Kx1K on an A100).* Excellent bilingual (Chinese/English) text rendering, even small fonts.* Uses Scaled-ROPE-v2 for better high-resolution generation without artifacts.* Claims to outperform SDXL-Turbo and Qwen-Image on fidelity and prompt adherence benchmarks.* Available via Python SDK and REST API within their Doubao Studio and coming soon to dreamina.com Phew! We made it. What an absolute avalanche of news. OpenAI truly dominated with the back-to-back launches of the hyper-capable o3/o4-mini and the massively scaled GPT-4.1 family. Google countered strongly with the versatile Gemini 2.5 Flash, key GA releases like Veo-2, and the strategically important A2A protocol. The agent ecosystem took huge leaps forward with both A2A and broader MCP adoption. And we saw continued innovation in multimodal embeddings, video generation, and even niche areas like bioacoustics and low-bit models.If you feel like you missed anything (entirely possible this week!), the TL;DR and links below should help. Please subscribe if you haven't already, and share this with a friend if you found it useful – it's the best way to support the show!I have a feeling next week won't be any slower. Follow us on X/Twitter for breaking news between shows!Thanks for tuning in, keep building, keep learning, and I'll see you next Thursday!AlexTL;DR and Show NotesEverything we covered today in bite-sized pieces with links!* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed)* Todd Segal - Principal Software Engineer @ Google - Working on A2A Protocol* Big CO LLMs + APIs* 👑 OpenAI launches o3 and o4-mini in chatGPT & API (Blog, Our Coverage, o3 and o4-mini announcement)* OpenAI launches GPT 4.1, 4.1-mini and 4.1-nano in API (Our Coverage, Prompting guide)* 🚨 Google launches Gemini 2.5 Flash with controllable thinking budgets (Blog Post - Placeholder Link, API Docs)* Mistral classifiers Factory* Claude does research + workspace integration (Blog)* Cohere Embed‑4 — Multimodal embeddings for enterprise search (Blog, Docs Changelog, X)* Open Source LLMs* OpenAI open sources MRCR Long‑Context Benchmark (Hugging Face)* Microsoft BitNet v1.5 (HF)* INTELLECT‑2 — Prime Intellect’s 32B “globally‑distributed RL” experiment (Blog, X)* Z.ai (previously chatGLM) + GLM‑4‑0414 open‑source family (X, HF Collection, GitHub)* This weeks Buzz + MCP/A2A* Weave playground support for GPT 4.1 and o3/o4-mini models (X)* Chat with Todd Segal - A2A Protocol (GitHub Spec)* Vision & Video* Veo‑2 Video Generation in GA, Gemini App (Dev Blog)* Kling 2.0 Creative Suite (X, Blog)* ByteDance public Seaweed-7B, a video generation foundation model (seaweed.video)* Voice & Audio* DolphinGemma — Google AI tackles dolphin communication (Blog)* AI Art & Diffusion & 3D* Seedream 3.0 bilingual image diffusion – ByteDance (Tech post, arXiv, AIbase news)* Tools* OpenAI debuts Codex CLI, an open source coding tool for terminals (Github)* Use o3 with Windsurf (which OpenAI is rumored to buy at $3B) via the mac app integration + write back + multiple files This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
--------
1:55:51
💯 ThursdAI - 100th episode 🎉 - Meta LLama 4, Google tons of updates, ChatGPT memory, WandB MCP manifesto & more AI news
Hey Folks, Alex here, celebrating an absolutely crazy (to me) milestone, of #100 episodes of ThursdAI 👏 100 episodes in a year and a half (as I started publishing much later than I started going live, and the first episode was embarrassing), 100 episodes that documented INCREDIBLE AI progress, we mention on the show today, we used to be excited by context windows jumping from 4K to 16K! I want to extend a huge thank you to every one of you, who subscribes, listens to the show on podcasts, joins the live recording (we regularly get over 1K live viewers across platforms), shares with friends and highest thank you for the paid supporters! 🫶 Sharing the AI news progress with you, energizes me to keep going, despite the absolute avalanche of news every week.And what a perfect way to celebrate the 100th episode, on a week that Meta dropped Llama 4, sending the open-source world into a frenzy (and a bit of chaos). Google unleashed a firehose of announcements at Google Next. The agent ecosystem got a massive boost with MCP and A2A developments. And we had fantastic guests join us – Michael Lou diving deep into the impressive DeepCoder-14B, and Liad Yosef & Ido Salomon sharing their wild ride creating the viral GitMCP tool.I really loved today's show, and I encourage those of you who only read, to give this a watch/listen, and those of you who only listen, enjoy the recorded version (though longer and less edited!) Now let's dive in, there's a LOT to talk about (TL;DR and show notes as always, at the end of the newsletter) Open Source AI & LLMs: Llama 4 Takes Center Stage (Amidst Some Drama)Meta drops Llama 4 - Scout 109B/17BA & Maverick 400B/17BA (Blog, HF, Try It)This was by far the biggest news of this last week, and it dropped... on a Saturday? (I was on the mountain ⛷️! What are you doing Zuck) Meta dropped the long awaited LLama-4 models, huge ones this time* Llama 4 Scout: 17B active parameters out of ~109B total (16 experts).* Llama 4 Maverick: 17B active parameters out of a whopping ~400B total (128 experts).* Unreleased: Behemoth - 288B active with 2 Trillion total parameters chonker!* Both base and instruct finetuned models were releasedThese new models are all Multimodal, Multilingual MoE (mixture of experts) architecture, and were trained with FP8, for significantly more tokens (around 30 Trillion Tokens!) with interleaved attention (iRoPE), and a refined SFT > RL > DPO post-training pipeline.The biggest highlight is the stated context windows, 10M for Scout and 1M for Maverick, which is insane (and honestly, I haven't yet seen a provider that is even remotely able to support anything of this length, nor do I have the tokens to verify it) The messy release - Big Oof from Big ZuckNot only did Meta release on a Saturday, messing up people's weekends, Meta apparently announced a high LM arena score, but the model they provided to LMArena was... not the model they released!?It caused LMArena to release the 2000 chats dataset, and truly, some examples are quite damning and show just how unreliable LMArena can be as vibe eval. Additionally, during the next days, folks noticed discrepancies between the stated eval scores Meta released, and the ability to evaluate them independently, including our own Wolfram, who noticed that a quantized version of Scout, performed better on his laptop while HIGHLY quantized (read: reduced precision) than it was performing on the Together API inference endpoint!? We've chatted on the show that this may be due to some VLLM issues, and speculated about other potential reasons for this. Worth noting the official response from Ahmad Al-Dahle, head of LLama at Meta, who mentioned stability issues between providers and absolutely denied any training on any benchmarksToo big for its own good (and us?)One of the main criticism the OSS community had about these releases, is that for many of us, the reason for celebrating Open Source AI, is the ability to run models without network, privately on our own devices. Llama 3 was released in 8-70B distilled versions and that was incredible for us local AI enthusiasts! These models, despite being "only" 17B active params, are huge and way to big to run on most local hardware, and so the question then is, if we're getting a model that HAS to run on a service, why not use Gemini 2.5 that's MUCH better and faster and cheaper than LLama? Why didn't Meta release those sizes? Was it due to an inability to beat Qwen/DeepSeek enough? 🤔 My TakeDespite the absolutely chaotic rollout, this is still a monumental effort from Meta. They spent millions on compute and salaries to give this to the community. Yes, no papers yet, the LM Arena thing was weird, and the inference wasn't ready. But Meta is standing up for Western open-source in a big way. We have to celebrate the core contribution while demanding better rollout practices next time. As Wolfram rightly said, the real test will be the fine-tunes and distillations the community builds on these base models. Releasing the base weights is crucial for that. Let's see if the community can tame this beast once the inference dust settles. Shout out to Ahmed Al-Dahle and the whole Llama team at Meta – incredible work, messy launch, but thank you for pushing open source forward. 🎉Together AI & Agentica (Berkley) finetuned DeepCoder-14B with reasoning (X, Blog)Amidst the Llama noise, we got another stellar open-source release! We were thrilled to have Michael Lou from Agentica/UC Berkeley join us to talk about DeepCoder-14B-Preview which beats DeepSeek R1 and even o3-mini on several coding benchmarks. Using distributed Reinforcement Learning (RL), it achieves 60.6% Pass@1 accuracy on LiveCodeBench, matching the performance of models like o3-mini-2025-01-31 (Low) despite its smaller size.The stated purpose of the project is to democratize RL and they have open sourced the model (HF), the dataset (HF), the Weights & Biases logs and even the eval logs! Shout out to Michael, Sijun and Alpay and the rest of the team who worked on this awesome model! NVIDIA Nemotron ULTRA is finally here, 253B pruned Llama 3-405B (HF)While Llama 4 was wrapped in mystery, NVIDIA dropped their pruned and distilled finetune of the previous Llama chonker 405B model, turning at just about half the parameters. And they were able to include the LLama-4 benchmarks in their release, showing that the older Llama, finetuned can absolutely beat the new ones at AIME, GPQA and more. As a reminder, we covered the previous 2 NEMOTRONS and they are a combined reasoning and non reasoning models, so the jump is not that surprising, and it does seem like a bit of eval cherry picking happened here. Nemotron Ultra supports 128K context and fits on a single 8xH100 node for inference. Built on open Llama models and trained on vetted + synthetic data, it's commercially viable. Shout out to NVIDIA for releasing this, and especially for pushing open reasoning datasets which Nisten rightly praised as having long-term value beyond the models themselves.More Open Source Goodness: Jina, DeepCogito, KimiThe open-source train didn't stop there:* Jina Reranker M0: Our friends at Jina released a state-of-the-art multimodal reranker model. If you're doing RAG with images and text, this looks super useful for improving retrieval quality across languages and modalities (Blog, HF)* DeepCogito: A new company emerged releasing a suite of Llama fine-tunes (3B up to 70B planned, with larger ones coming) trained using a technique they call Iterated Distillation and Amplification (IDA). They claim their 70B model beats DeepSeek V2 70B on some benchmarks . Definitely one to watch. (Blog, HF)* Kimi-VL & Kimi-VL-Thinking: MoonShot, who sometimes get lost in the noise, released incredibly impressive Kimi Vision Language Models (VLMs). These are MoE models with only ~3 Billion active parameters, yet they're showing results on par with or even beating models 10x larger (like Gemma 2 27B) on benchmarks like MathVision and ScreenSpot. They handle high-res images, support 128k context, and crucially, include a reasoning VLM variant. Plus, they're MIT licensed! Nisten's been following Kimi and thinks they're legit, just waiting for easier ways to run them locally. Definitely keep an eye on Kimi. (HF)This Week's Buzz from Weights & Biases - Observable Tools & A2A!This week was personally very exciting on the W&B front, as I spearheaded and launched initiatives directly related to the MCP and A2A news!W&B launches the observable.tools initiative!As MCP takes off, one challenge becomes clear: observability. When your agent calls an external MCP tool, that part of the execution chain becomes a black box. You lose the end-to-end visibility crucial for debugging and evaluation.That's why I'm thrilled that we launched Observable Tools (Website) – an initiative championing full-stack agent observability, specifically within the MCP ecosystem. Our vision is to enable developers using tools like W&B Weave to see inside those MCP tool calls, getting a complete trace of every step.The core of this is Proposal RFC 269 on the official MCP GitHub spec, which I authored! (My first RFC, quite the learning experience!). It details how to integrate OpenTelemetry tracing directly into the MCP protocol, allowing tools to securely report detailed execution spans back to the calling client (agent). We went deep on the spec, outlining transmission mechanisms, schemas, and rationale.My ask to you, the ThursdAI community: Please check out observable.tools, read the manifesto, watch the fun video we made, and most importantly, go to the RFC 269 proposal (shortcut: wandb.me/mcp-spec). Read it, comment, give feedback, and upvote if you agree! We need community support to make this impossible for the MCP maintainers to ignore. Let's make observability a first-class citizen in the MCP world! We also invite our friends from across the LLM observability landscape (LangSmith, Braintrust, Arize, Galileo, etc.) to join the discussion and collaborate.W&B is a Launch Partner for Google's A2AAs mentioned earlier, we're also excited to be a launch partner for Google's new Agent2Agent (A2A) protocol. We believe standardized communication between agents is the next critical step, and we'll be supporting A2A alongside MCP in our tools. Exciting times for agent infrastructure! I've invited Google folks to next week to discuss the protocol in depth! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Big Company LLMs + APIs: Google's Onslaught & OpenAI's Memory UpgradeWhile open source had a wild week, the big players weren't sleeping. Google especially came out swinging at Google Next.Google announces TONS of new things at Next 🙌 (Blog)Google I/O felt like a preview, Google Next felt like the delivery truck backing up and dumping everything. Here's the whirlwind tour:* Gemini 2.5 Flash API: The faster, cheaper Gemini 2.5 model is coming soon to Vertex AI. (Still waiting on that general API access!).* Veo 2 Editing: Their top-tier video model (competing with Sora, Kling) gets editing capabilities. Very cool.* Imagen 3 Updates: Their image model gets improvements, including inpainting.* Lyria: Text-to-music model moves into preview.* TPU v7 (Ironwood): New TPU generation coming soon. As Nisten noted, Google's infrastructure uptime is consistently amazing, which could be a winning factor regardless of model SOTA status.* Chirp 3 HD Voices + Voice Cloning: This one raised eyebrows. The notes mentioned HD voices and voice cloning. Cloning is a touchy subject the big players usually avoid publicly (copyright, deepfakes). Still digging for confirmation/details on this – if Google is really offering public voice cloning, that's huge. Let me know if you find a link!* Deep Research gets Gemini 2.5 Pro: The experimental deep research feature in Gemini (their answer to OpenAI's research agent) now uses the powerful 2.5 Pro model. Google released comparison stats showing users strongly prefer it (70%) over OpenAI's offering, citing better instruction following and comprehensiveness. I haven't fully tested the 2.5 version yet, but the free tier access is amazing. and just look at those differences in preference compared to OAI Deep Research! Firebase Studio (firebase.studio): Remember Project IDX? It's been rebranded and launched as Firebase Studio. This is Google's answer to the wave of "vibe coding" web builders like Lovable, Bolt and a few more. It's a full-stack, cloud-based GenAI environment for building, testing, and deploying apps, integrated with Firebase and likely Gemini. Looks promising!Google Embraces MCP & Launches A2A Protocol!Two massive protocol announcements from Google that signal the maturation of the AI agent ecosystem:* Official MCP Support! (X)This is huge. Following Microsoft and AWS, Google (via both Sundar Pichai and Demis Hassabis) announced official support for Anthropic's Model Context Protocol (MCP) in Gemini models and SDKs. MCP is rapidly becoming the standard for how agents discover and use tools securely and efficiently. With Google onboard, there's basically universal major vendor support. MCP is here to stay.* Agent2Agent (A2A) Protocol (Blog , Spec, W&B Blog)Google also launched a new open standard, A2A, designed for interoperability between different AI agents. Think of agents built by different vendors (Salesforce, ServiceNow, etc.) needing to talk to each other securely to coordinate complex workflows across enterprise systems. Built on web standards (HTTP, SSE, JSON-RPC), it handles discovery, task management (long-running!), and modality negotiation. Importantly, Google positions A2A as complementary to MCP, not competitive. MCP is how an agent uses a tool, A2A is how an agent talks to another agent. Weights & Biases is proud to be one of the 50+ launch partners working with Google on this! We'll do a deeper dive soon, but this + MCP feels like the foundation for a truly interconnected agent future.Cloudflare - new Agents SDK (agents.cloudflare.com)Speaking of agents, Cloudflare launched their new Agents SDK (npm i agents). Built on their serverless infrastructure (Workers, Durable Objects), it offers a platform for building stateful, autonomous AI agents with a compelling pricing model (pay for CPU time, not wall time). This ties into the GitMCP story later – Cloudflare is betting big on the edge agent ecosystem.Other Big Co News:* Anthropic MAX: A new $200/month tier for Claude, offering higher usage quotas but no new models. Meh.* Grok 3 API: Elon's xAI finally launched the API tier for Grok 3 (plus Fast and Mini variants). Now you can test its capabilities yourself. We're still waiting for the promised Open Source Grok-2🚨 BREAKING NEWS 🚨 OpenAI Upgrades MemoryRight on cue during the show, OpenAI dropped a feature update! Sam Altman hyped something coming, and while it wasn't the o3/o4-mini models (those are coming next), it's a significant enhancement to ChatGPT Memory.Previously, Memory tried to selectively save key facts. Now, when enabled, it can reference ALL of your past chats to personalize responses. Preferences, interests, past projects – it can potentially draw on everything. OpenAI states there's no storage limit for what it can reference.How? Likely some sophisticated RAG/vector search under the hood, not stuffing everything into context. LDJ mentioned he might have had this rolling out silently for weeks, and while the immediate difference wasn't huge, the potential is massive as models get better at utilizing this vast personal context.The immediate reaction? Excitement mixed with a bit of caution. As Wolfram pointed out, do I really want it remembering every single chat? Configurable memory (flagging chats for inclusion/exclusion) seems like a necessary follow-up. Thanks for the feature request, Wolfram! (And yes, Europe might not get this right away anyway...). This could finally stop ChatGPT from asking me basic questions it should know from our history!Prompt suggestion: Ask the new chatGPT with memory, a think that you asked it that you likely forgot.Just don't asked it what was the most boring thing you asked it, I got cooked I'm still feeling raw 😂 Vision & Video: Kimi Drops Tiny But Mighty VLMsThe most impressive long form AI video paper dropped, showing that it's possible to create 1 minute long video, with incredible character and scene consistencyThis paper introduces TTT layers (Test Time Training) to a pre-trained transformer, allowing it to one shot generate these incredibly consistent long scenes. Can't wait to see how the future of AI video evolves with this progress! AI Art & Diffusion & 3D: HiDream Takes the Open CrownHiDream-I1-Dev 17B MIT license new leading open weights image gen! (HF)Just when we thought the image gen space was settling, HiDream, a Chinese company, open-sourced their HiDream-I1 family under MIT license! This 17B parameter model comes in Dev, Full, and Fast variants.The exciting part? Based on early benchmarks (like Artificial Analysis Image Arena), HiDream-I1-Dev surpasses Flux 1.1 [Pro], Recraft V3, Reve and Imagen 3 while being open source! It boasts outstanding prompt following and text rendering capabilities.HiDream's API is coming soon too and I really hope it's finetunable! Tools: GitMCP - The Little Repo Tool That CouldGitMCP - turn any github repo into an MCP server (website)We wrapped up the show with a fantastic story from the community. We had Liad Yosef (Shopify) and Ido Salomon (Palo Alto Networks) join us to talk about GitMCP.It started with a simple problem: a 3MB LLM.txt file (a format proposed by Jeremy Howard for repo documentation) too large for context windows. Liad and Ido, working nights and weekends, built an MCP server that could ingest any GitHub repo (prioritizing LLM.txt if present, falling back to Readmes/code comments) and expose its documentation via MCP tools (semantic search, fetching).This means any MCP-compatible client (like Cursor, potentially future ChatGPT/Gemini) can instantly query the documentation of any public GitHub repo just by pointing to the GitMCP URL for that repo (e.g., https://gitmcp.io/user/repo). As Yam Peleg pointed out during the show, the genius here is dynamically generating a customized tool specifically for that repo, making it incredibly easy for the LLM to use.Then, the story got crazy. They launched, went viral, almost melted their initial Vercel serverless setup due to traffic and SSE connection costs (100$+ per hour!). DMs flew back and forth with Vercel's CEO, then Cloudflare's CTO swooped in offering to sponsor hosting on Cloudflare's unreleased Agents platform if they migrated immediately. A frantic weekend of coding ensued, culminating in a nail-biting domain switch and a temporary outage before getting everything stable on Cloudflare.The project has received massive praise (including from Jeremy Howard himself) and is solving a real pain point for developers wanting to easily ground LLMs in project documentation. Huge congrats to Liad and Ido for the amazing work and the wild ride! Check out gitmcp.io!Wrapping Up Episode 100!Whew! What a show. From the Llama 4 rollercoaster to Google's AI barrage, the rise of agent standards like MCP and A2A, groundbreaking open source models, and incredible community stories like GitMCP – this episode truly showed an exemplary week in AI and underlined the reason I do this every week. It's really hard to keep up, and so if I commit to you guys, I stay up to date myself! Hitting 100 episodes feels surreal. It's been an absolute privilege sharing this journey with Wolfram, LDJ, Nisten, Yam, all our guests, and all of you. Seeing the community grow, hitting milestones like 1000 YouTube subscribers today, fuels us to keep going 🎉 The pace isn't slowing down. If anything, it's accelerating. But we'll be right here, every Thursday, trying to make sense of it all, together.If you missed anything, don't worry! Subscribe to the ThursdAI News Substack for the full TL;DR and links below.Thanks again for making 100 episodes possible. Here's to the next 100! 🥂Keep tinkering, keep learning, and I'll see you next week.AlexTL;DR and Show Notes* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Michael Luo @michaelzluo - CS PhD @ UC Berkeley; AI & Systems* Liad Yosef (@liadyosef), Ido Salomon (@idosal1) - GitMCP creators* Open Source LLMs * Meta drops LLama 4 (Scout 109B/17BA & Maverick 400B/17BA) - (Blog, HF, Try It)* Together AI and Agentica (UC Berkley) announce DeepCoder-14B (X, Blog)* NVIDIA Nemotron Ultra is here! 253B pruned LLama 3-405B (X, HF)* Jina Reranker M0 - SOTA multimodal reranker model (Blog, HF)* DeepCogito - SOTA models 3-70B - beating DeepSeek 70B - (Blog, HF)* ByteDance new release - Seed-Thinking-v1.5* Big CO LLMs + APIs* Google announces TONS of new things 🙌 (Blog)* Google launches Firebase Studio (website)* Google is announcing official support for MCP (X)* Google announces A2A protocol - agent 2 agent communication (Blog, Spec, W&B Blog)* Cloudflare - new Agents SDK (Website)* Anthropic MAX - $200/mo with more quota* Grok 3 finally launches API tier (API)* OPenAI adds enhanced memory to ChatGPT - can remember all your chats (X)* This weeks Buzz - MCP and A2A* W&B launches the observable.tools initiative & invite people to comment on the MCP RFC* W&B is the launch partner for Google's A2A (Blog)* Vision & Video* Kimi-VL and Kimi-VL-Thinking - A3B vision models (X, HF)* One-Minute Video Generation with Test-Time Training (Blog, Paper)* Voice & Audio* Amazon - Nova Sonic - speech2speech foundational model (Blog)* AI Art & Diffusion & 3D* HiDream-I1-Dev 17B MIT license new leading open weights image gen 0 passes Flux1.1[pro] ! (HF)* Tools* GitMCP - turn any github repo into an MCP server (try it)ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
--------
1:32:18

Más podcasts de Noticias

Acerca de ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

Sitio web del podcast

Noticias Noticias tecnológicas Tecnología