Issue #387 - The ML Engineer 🤖

Toto 2.0 Time Series Foundation Model, TabPFN-3 Tabular Foundation Models, Stanford SWE Real-World Dataset, DeepMind AlphaEvolve Scaling Impact, Mozilla on Mithos Vulnerabilities + more 🚀

May 17, 2026

Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐

If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

This week in ML Engineering:

Toto 2.0 Time Series Foundation Model
TabPFN-3 Tabular Foundation Models
Stanford SWE Real-World Dataset
DeepMind AlphaEvolve Scaling Impact
Mozilla on Mithos Vulnerabilities
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀

Toto 2.0 Time Series Foundation Model

Datadog has just dropped a huge update to their Time Series Foundation Model for observability Toto 2.0, and this is really exciting for production ML practitioners: Toto 2.0 is a new Apache 2.0 open-weights model ranging from small 4M params all the way to to 2.5B parameters. At least from Datadog’s own observability-heavy benchmark it seems like there is not only potential for real-world use, but this is an interesting insight that domain specific time-series foundation models is clearly a realistic path for practical use. It is also good to see Datadog be quite honest about the remaining gaps, as even the largest model still shows long-horizon drift and structural breakdown past training context, so classical baselines are not going away. But overall this feels like a meaningful step towards forecasting foundation models becoming a real option, and potentially towards broader observability models that reason across metrics, traces, logs, topology, code changes, alerts and events for proactive incident detection.

TabPFN-3 Tabular Foundation Models

Tabular ML is still where a huge amount of production ML actually happens, so it is great to see the latest Tabular Foundation Model release from Prior Labs: TabPFN-3 is live with support for up to 1M training rows, row-chunking, a reduced KV-cache, native missing-value handling, many-class classification up to 160 classes, GPU-side preprocessing, and much faster inference than TabPFN-2.5. It is interesting to see the benchmarks, as you can expect that TabPFN-3 reports better performance than tuned and ensembled baselines on TabArena, beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M rows and 200 features, but it will be interesting also to see real world and competitive benchmarks as further alternatives arise as well. For production ML practitioners, the interesting part is less the leaderboard and more the potential that we’re moving to a world where we can get faster baselines, less painful hyperparameter search, better calibrated predictive distributions, CPU-friendly distillation and faster SHAP-style interpretability workflows.

Stanford SWE Real-World Dataset

Stanford has just published a really interesting new dataset of real coding-agent sessions using public GitHub repos with ~6k sessions, 63K user prompts, 355K tool calls, git-linked diffs, and line-level attribution of whether code was written by humans or agents. It is interesting to see that coding-agent usage is already becoming extremely bimodal, with around 41% of sessions basically “vibe coding” where the agent writes almost all committed code, while 23% are still human-only. However the important takeaway is that they are still very inefficient and risky when used in the wild, with only ~44% of agent-produced code surviving into commits, users push back or interrupt in roughly 44% of turns, and vibe-coded commits introduce substantially more Semgrep-detected vulnerabilities than human-only or collaborative coding. For production ML practitioners building coding agents, evals, IDE copilots, or internal developer tooling, this is a strong reminder that the winning pattern is probably not full autonomy, but better scaffolding around agents.

DeepMind AlphaEvolve Scaling Impact

Google DeepMind showcases how they are bringing AlphaEvolve alive as an optimization engine for the expensive parts of ML, science and infrastructure: The AlphaEvolve system from Google has now been applied across genomics, power grids, disaster prediction, quantum circuits, mathematics, TPU design, Spanner, compiler optimization, logistics, advertising, lithography and ML force fields, with some genuinely impressive reported numbers. As part of their report they outline a 30% reduction in DNA variant detection errors for DeepConsensus, AC Optimal Power Flow feasible-solution rates going from 14% to over 88%, 10x lower-error quantum circuits, 20% lower Spanner write amplification, and nearly 9% lower software storage footprint. It seesm they also have use-cases, showcasing Klarna doubling training speed and Schrödinger seeing roughly 4x speedups for MLFF training and inference.

Mozilla on Mithos Vulnerabilities

Mozilla shared a great behind-the-scenes look at how they used Claude Mythos Preview and other models to harden Firefox, and what is really interesting is that this was not just “LLM finds bugs” but a proper end-to-end security pipeline: The Mozilla team built an agentic harness on top of existing fuzzing infrastructure, where models could inspect risky parts of the browser codebase, generate reproducible test cases, run them, and then feed validated findings into the normal lifecycle for deduplication, triage, patching and release. The numbers are quite impressive, with Firefox 150 shipping fixes for 271 bugs found with Claude Mythos Preview, including 180 sec-high issues, and Mozilla fixing 423 security bugs across April releases when combining this pipeline with other AI models + manual review. This feels like a very clear MLSecOps pattern that more teams will need to adopt, where frontier models become scalable security-reasoning workers inside controlled harnesses that can execute tests, verify claims and eventually scan patches continuously in the CI/CD pipelines.

Upcoming MLOps Events

The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.

Events we are speaking at this year:

eTail Europe - March @ Berlin
World Summit AI Europe - September @ Amsterdam

Other relevant events:

KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
AI Infra Summit 2026 - Sept @ California
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin

Open Source MLOps Tools

Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here’s a few featured open source libraries that we maintain:

SARC - Provides wrappers for popular agentic frameworks to enable guardrails and constraints that are enforced through the flow.
KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain

Please do support some of our open source projects by sharing, contributing or adding a star ⭐

About us

The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.

Check out our website

The Machine Learning Engineer

Ready for more?