Issue #387 - The ML Engineer 🤖
Toto 2.0 Time Series Foundation Model, TabPFN-3 Tabular Foundation Models, Stanford SWE Real-World Dataset, DeepMind AlphaEvolve Scaling Impact, Mozilla on Mithos Vulnerabilities + more 🚀
Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐
If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!
This week in ML Engineering:
Toto 2.0 Time Series Foundation Model
TabPFN-3 Tabular Foundation Models
Stanford SWE Real-World Dataset
DeepMind AlphaEvolve Scaling Impact
Mozilla on Mithos Vulnerabilities
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀
Toto 2.0 Time Series Foundation Model
Datadog has just dropped a huge update to their Time Series Foundation Model for observability Toto 2.0, and this is really exciting for production ML practitioners: Toto 2.0 is a new Apache 2.0 open-weights model ranging from small 4M params all the way to to 2.5B parameters. At least from Datadog’s own observability-heavy benchmark it seems like there is not only potential for real-world use, but this is an interesting insight that domain specific time-series foundation models is clearly a realistic path for practical use. It is also good to see Datadog be quite honest about the remaining gaps, as even the largest model still shows long-horizon drift and structural breakdown past training context, so classical baselines are not going away. But overall this feels like a meaningful step towards forecasting foundation models becoming a real option, and potentially towards broader observability models that reason across metrics, traces, logs, topology, code changes, alerts and events for proactive incident detection.
TabPFN-3 Tabular Foundation Models
Tabular ML is still where a huge amount of production ML actually happens, so it is great to see the latest Tabular Foundation Model release from Prior Labs: TabPFN-3 is live with support for up to 1M training rows, row-chunking, a reduced KV-cache, native missing-value handling, many-class classification up to 160 classes, GPU-side preprocessing, and much faster inference than TabPFN-2.5. It is interesting to see the benchmarks, as you can expect that TabPFN-3 reports better performance than tuned and ensembled baselines on TabArena, beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M rows and 200 features, but it will be interesting also to see real world and competitive benchmarks as further alternatives arise as well. For production ML practitioners, the interesting part is less the leaderboard and more the potential that we’re moving to a world where we can get faster baselines, less painful hyperparameter search, better calibrated predictive distributions, CPU-friendly distillation and faster SHAP-style interpretability workflows.
Stanford SWE Real-World Dataset
Stanford has just published a really interesting new dataset of real coding-agent sessions using public GitHub repos with ~6k sessions, 63K user prompts, 355K tool calls, git-linked diffs, and line-level attribution of whether code was written by humans or agents. It is interesting to see that coding-agent usage is already becoming extremely bimodal, with around 41% of sessions basically “vibe coding” where the agent writes almost all committed code, while 23% are still human-only. However the important takeaway is that they are still very inefficient and risky when used in the wild, with only ~44% of agent-produced code surviving into commits, users push back or interrupt in roughly 44% of turns, and vibe-coded commits introduce substantially more Semgrep-detected vulnerabilities than human-only or collaborative coding. For production ML practitioners building coding agents, evals, IDE copilots, or internal developer tooling, this is a strong reminder that the winning pattern is probably not full autonomy, but better scaffolding around agents.
DeepMind AlphaEvolve Scaling Impact
Google DeepMind showcases how they are bringing AlphaEvolve alive as an optimization engine for the expensive parts of ML, science and infrastructure: The AlphaEvolve system from Google has now been applied across genomics, power grids, disaster prediction, quantum circuits, mathematics, TPU design, Spanner, compiler optimization, logistics, advertising, lithography and ML force fields, with some genuinely impressive reported numbers. As part of their report they outline a 30% reduction in DNA variant detection errors for DeepConsensus, AC Optimal Power Flow feasible-solution rates going from 14% to over 88%, 10x lower-error quantum circuits, 20% lower Spanner write amplification, and nearly 9% lower software storage footprint. It seesm they also have use-cases, showcasing Klarna doubling training speed and Schrödinger seeing roughly 4x speedups for MLFF training and inference.
Mozilla on Mithos Vulnerabilities
Mozilla shared a great behind-the-scenes look at how they used Claude Mythos Preview and other models to harden Firefox, and what is really interesting is that this was not just “LLM finds bugs” but a proper end-to-end security pipeline: The Mozilla team built an agentic harness on top of existing fuzzing infrastructure, where models could inspect risky parts of the browser codebase, generate reproducible test cases, run them, and then feed validated findings into the normal lifecycle for deduplication, triage, patching and release. The numbers are quite impressive, with Firefox 150 shipping fixes for 271 bugs found with Claude Mythos Preview, including 180 sec-high issues, and Mozilla fixing 423 security bugs across April releases when combining this pipeline with other AI models + manual review. This feels like a very clear MLSecOps pattern that more teams will need to adopt, where frontier models become scalable security-reasoning workers inside controlled harnesses that can execute tests, verify claims and eventually scan patches continuously in the CI/CD pipelines.
Upcoming MLOps Events
The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.
Events we are speaking at this year:
eTail Europe - March @ Berlin
World Summit AI Europe - September @ Amsterdam
Other relevant events:
KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
AI Infra Summit 2026 - Sept @ California
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin
In case you missed our talks, check our recordings below:
The State of AI in 2025 - WeAreDevelopers 2025
Prod Generative AI in 2024 - KubeCon AI Day 2025
The State of AI in 2024 - WeAreDevelopers 2024
Responsible AI Workshop Keynote - NeurIPS 2021
Practical Guide to ML Explainability - PyCon London
ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
Metadata for E2E MLOps - Kubecon NA 2022
ML Performance Evaluation at Scale - KubeCon Eur 2021
Industry Strength LLMs - PyData Global 2022
ML Security Workshop Keynote - NeurIPS 2022
Open Source MLOps Tools
Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here’s a few featured open source libraries that we maintain:
SARC - Provides wrappers for popular agentic frameworks to enable guardrails and constraints that are enforced through the flow.
KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain
Please do support some of our open source projects by sharing, contributing or adding a star ⭐
About us
The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.
