Issue #389 - The ML Engineer 🤖
Engineering Like It's 2007, Netflix LLM Finetuning Infra, OpenAI & Anthropic Finding Market Fit, Reviving PapersWithCode.co, Massive Open Text-To-Image Dataset + more 🚀
Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐
If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!
This week in ML Engineering:
Engineering Like It’s 2007
Netflix LLM Finetuning Infra
OpenAI & Anthropic Finding Market Fit
Reviving PapersWithCode.co
Massive Open Text-To-Image Dataset
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀
Let’s take a trip back to 2007! This engineering talk from YouTube is truly a master class of scaling and learning, and surprisingly the lessons are as valuable today as they were back then. This talk shows a lean and mean team iterating through bottlenecks across web serving, video delivery, thumbnails, databases, hardware, OS tuning, caching and sharding. The most useful lesson for ML engineering teams is that scale is rarely solved by one abstraction, as YouTube kept the system simple enough to rewrite under pressure, used caching at multiple layers, and leveraged first principles at every layer. It is pretty cool to see that they followed the standard playbook of best practice of scalability, the usual suspects of moving hot traffic through CDNs, tuning commodity hardware (maybe less common now), and eventually replacing replication tricks with database partitioning. For ML practitioners, the parallel is model serving, feature stores, vector databases, observability pipelines and agent systems, which are currently hitting us with the same analogous challenges that we have to solve at lighting speed.
Netflix is sharing their playbook for massive scale LLM fine-tuning infrastructure: Netflix moved away from a few fine-tuning scripts to a managed framework that supports SFT, DPO, RL, distillation, checkpointing, MFU tracking, Hugging Face-compatible model/tokenizer flows, and distributed orchestration. It is interesting that Ray has chosen also the usual suspect technologies like Ray, PyTorch, vLLM together with custom tooling (eg Netflix’s internal Mako platform). The interesting takeaway for production ML practitioners is that post-training is challenging across every layer, including Data, Model, Compute and Workflow. Netflix reports up to 4.7x effective token throughput from asynchronous on-the-fly sequence packing, and this is a great example of the direction GenAI infrastructure is taking.
OpenAI & Anthropic Finding Market Fit
Open AI & Anthropic have found Market Fit - this is an interesting opinion piece from Simon Willison: Here there is a good case that coding agents may be the first real product-market fit moment for frontier AI labs because enterprises are now being charged close to raw API-token economics for daily developer workflows. OpenAI Codex and Anthropic Claude Code have shifted from huge subsidies / subscriptions to usage-based pricing (unfortunately for us subsidies are indeed ending). It is now clear that organisations will need to establish much tighter cost observability, usage governance, ROI measurement, procurement discipline, and platform controls around coding agents, just as they already do for serving and inference workloads. This starts making it clear why tools like LiteLLM or OpenRouter are becoming so popular, even if at the beginning it wasn’t super clear why an extra abstraction layer was needed on top of a simple vendor API (eg. surprisingly enough many vendors still do not offer spend caps).
I remember when Papers With Code originally came out in 2018 it was a major breakthrough; after the meta acquisition the project slowed and then stopped, but it seems there is an attempt to revive it! Hugging Face has started reviving it as paperswithcode.co with support from some AI agents to help with the parsing of papers, auto-linking GitHub repos, project pages and artifacts, categorizing, and generating leaderboards. The new site already brings back the familiar discovery workflow around trending papers, SOTA browsing, methods and domains, while adding support for star-velocity trends, citation counts, external non-arXiv papers, multiple repos per paper, benchmark harness reports, and Hugging Face-native storage/login integration. As research moves faster (whether AI slop or otherwise), it’s still a basic need to have a well maintained discovery layer, so it’s great to see projects like this, hopefully it will continue growing.
Massive Open Text-To-Image Dataset
An interesting release of a new massive open text-to-image dataset: The MONET dataset. It is great to see this, as image generation not only depends on quality data, but also other key resources like benrhcmarks whcih can only help teams accelerate on this field. Better curation, filtering, captions, and provenance can really make quite a difference. This dataset consists of 104.9M curated image–text pairs distilled from 2.9B raw pairs, with safety filtering, domain filtering, exact/near duplicate removal, multi-VLM re-captioning, embeddings, object/face annotations, hashes, NSFW/watermark scores, and pre-encoded SANA-VAE latents for faster latent-diffusion training.
Upcoming MLOps Events
The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.
Events we are speaking at this year:
eTail Europe - March @ Berlin
World Summit AI Europe - September @ Amsterdam
Other relevant events:
KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
AI Infra Summit 2026 - Sept @ California
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin
In case you missed our talks, check our recordings below:
The State of AI in 2025 - WeAreDevelopers 2025
Prod Generative AI in 2024 - KubeCon AI Day 2025
The State of AI in 2024 - WeAreDevelopers 2024
Responsible AI Workshop Keynote - NeurIPS 2021
Practical Guide to ML Explainability - PyCon London
ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
Metadata for E2E MLOps - Kubecon NA 2022
ML Performance Evaluation at Scale - KubeCon Eur 2021
Industry Strength LLMs - PyData Global 2022
ML Security Workshop Keynote - NeurIPS 2022
Open Source MLOps Tools
Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here’s a few featured open source libraries that we maintain:
SARC - Provides wrappers for popular agentic frameworks to enable guardrails and constraints that are enforced through the flow.
KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain
Please do support some of our open source projects by sharing, contributing or adding a star ⭐
About us
The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.
