Issue #386 - The ML Engineer 🤖

Netflix Democratizing MLOps, Stanford AI Index Report 2026, META on ProgramBench, OpenAI on Delivering Voice AI, DeepMind Accelerating Gemma 4 + more 🚀

May 10, 2026

Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐

If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

This week in ML Engineering:

Netflix Democratizing MLOps
Stanford AI Index Report 2026
META on ProgramBench
OpenAI on Delivering Voice AI
DeepMind Accelerating Gemma 4
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀

Netflix Democratizing MLOps

Netflix shared an in-depth blog on how they democratized machine learning across their organisation by building the ML Model Lifecycle Graph, and there are some really practical learnings: Netflix describes how they built a Metadata Service and Model Lifecycle Graph to make ML assets discoverable, reusable, and debuggable across a fragmented production ML ecosystem spanning models, features, datasets, pipelines, experiments, and ownership systems. The core idea is that they ingest lightweight change events from source systems, hydrate them from the source of truth, normalize them into globally addressable entities, store relationship-heavy metadata in Datomic, index searchable fields in Elasticsearch, and asynchronously enrich cross-system links such as model-to-pipeline-to-A/B-test lineage. For production ML practitioners, it is an important reminder that teams can answer impact, lineage, ownership, and reuse questions from a unified metadata graph if it’s well documented and accessible to all teams. Mature ML platforms need metadata infrastructure as much as training or serving infrastructure, and it is interesting to see that there’s a lot of learnings in the dataOps space that can be applied to MLOps.

Stanford AI Index Report 2026

Stanford University has just dropped the 2026 Stanford AI Index Report, which shows critical risks on security, incidents spikes, and infrastructure constraints - here’s some key insights: Model capability is still accelerating, with industry now produces over 90% of notable models, and adoption reaching mainstream levels. Top frontier labls are converging so closely that production ML teams should prioritize guarding themselves from vendor lock-in to quickly be able to switch across when necessary. There are some operating risks, with benchmarks are saturating or proving unreliable, agents failing roughly one in three structured tasks, and incidents rising with constraints around chips, data centers, energy, water, and supply chains. For practitioners, the report’s main implication is that competitive advantage in 2026 is less about simply accessing the strongest model and more about building robust ML systems around it. The MOAT is now on reproducible evaluations, monitoring, incident response, data governance, cost-aware inference, human oversight, and clear measurement of productivity and safety outcomes.

META on ProgramBench

META is launching a new benchmark that aims to test LLMs on building large-scale / end-to-end applications like ffmpeg / sqlite / interpreters from docs, instead of just snippets/pull-requests: ProgramBench is META’s new benchmark which evaluates whether coding agents can rebuild full software projects from scratch using only a compiled executable and documentation, rather than editing an existing repo or filling in a scaffold. Current agents are far from reliable autonomous software engineers, as across 200 real open-source tasks0, no model fully solved any task. The best model only passed >=95% of tests on 3% of tasks. The benchmark is valuable because it tests the full lifecycle that production teams actually care about: discovering behavior by probing an executable, making architecture and language choices, implementing a buildable system, and matching black-box behavioral tests without being constrained to the original code structure. This does open some questions on how to mitigate pollution of training data, aka ensuring that the models don’t just have the solutions injected as part of their training itself, but that’s an existing issue that plagues the existing benchmarks today anways...

OpenAI on Delivering Voice AI

OpenAI has shared how they tackle real0time voice AI with milliseconds latency that ensures seamless experiences; at scale, shaving latency and stabilizing media transport can be the difference between a magical conversational agent vs a frustrating push-to-talk demo. OpenAI’s write-up is a useful production-infra case study for ML teams building realtime voice agents, as the main constraint is not just model latency, but especially the e2e including connection setup, NAT traversal, packet loss, jitter, first-hop routing, and stable ownership of WebRTC session state. From their post, it seems their solution splits responsibilities between a lightweight UDP relay and a stateful WebRTC transceiver. The broader takeaway for production ML practitioners is that realtime AI quality depends on treating networking and protocol termination as first-class ML platform concerns; namely keep client behavior standards-compliant, isolate hard session state, add complexity in a thin edge-routing layer, and let backend model services scale like normal services rather than WebRTC peers.

DeepMind Accelerating Gemma 4

Inference speed is becoming one of the most important battlegrounds in production AI - DeepMind shares how they accelerated Gemma 4 inference: Google introduces “Multi-Token Prediction drafters” as a production-inference optimization where they pair each Gemma 4 target model with a lightweight multi-token drafter that speculatively proposes several future tokens, while the main model verifies them in parallel. This apparently has yielded Google reported speedups of up to 3x without changing final output quality, which is quite impressive if true. For ML practitioners, the relevant takeaway is that latency bottlenecks are increasingly being attacked through serving architecture such as KV-cache sharing, activation reuse, runtime-specific support across serving frameworks (eg vLLM/MLX/Transformers) rather than only through smaller models or quantization. This is especially relevant for chat, coding assistants, agentic workflows, and on-device or workstation deployments where responsiveness and memory bandwidth dominate user experience, but teams should benchmark against their own prompts, batch sizes, sampling settings, and hardware.

Upcoming MLOps Events

The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.

Events we are speaking at this year:

eTail Europe - March @ Berlin
World Summit AI Europe - September @ Amsterdam

Other relevant events:

KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
AI Infra Summit 2026 - Sept @ California
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin

In case you missed our talks, check our recordings below:

The State of AI in 2025 - WeAreDevelopers 2025
Prod Generative AI in 2024 - KubeCon AI Day 2025
The State of AI in 2024 - WeAreDevelopers 2024
Responsible AI Workshop Keynote - NeurIPS 2021
Practical Guide to ML Explainability - PyCon London
ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
Metadata for E2E MLOps - Kubecon NA 2022
ML Performance Evaluation at Scale - KubeCon Eur 2021
Industry Strength LLMs - PyData Global 2022
ML Security Workshop Keynote - NeurIPS 2022

Open Source MLOps Tools

Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here’s a few featured open source libraries that we maintain:

KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain

Please do support some of our open source projects by sharing, contributing or adding a star ⭐

About us

The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.

Check out our website

The Machine Learning Engineer

Ready for more?