Issue #390 - The ML Engineer 🤖

C++ The Documentary, Stanford Language Modeling from Scratch, Tokenomics Where Tokens are Used, LLMs Need New Routing Paradigm, NVIDIA RTX Spark Launch + more 🚀

Jun 07, 2026

Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐

If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

This week in ML Engineering:

C++: The Documentary
Stanford Language Modeling from Scratch
Tokenomics Where Tokens are Used
LLMs Need New Routing Paradigm
NVIDIA RTX Spark Launch
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀

C++: The Documentary

I loved the C++ Documentary! People may not know but C++ is (shamelessly) my favourite language; I maintain an active (2k stars) OSS C++ codebase for fun, and did a lot of prod C++ dev back in the day, so this documentary was thoroughly enjoyable to watch. Despite all the hate, C++ is arguably one of the most important languages of our times; large % of modern AI foundation is built upon C++. It may not be as visible by Python/JS devs, and not as loved by Rust fans, but it’s still a fast growing language that supports the performance and scale many important systems today. This is an absolutely recommended watch for any ML practitioners, as it covers how C++ began when Bjarne wanted to combine C’s low-level interoperability advanced classes and modern language design. It’s interesting to see the evolution across Bell Labs, CFront, standardization, STL, C++98, the Java/C# era, and the C++11 renaissance; funnily enough, my main drivers tend to be C++11, with a tiny bit of C++14 / C++ 17, however I still havent found the main reason to move forward. Maybe once modules mature! Or an integrated package manager?! We all make-do with CMake, but it’s time to move on! Also I loved Bjarne’s quote: “It should’ve been called ++C for semantic reasons” - LOL. Also hilarious to hear that C++ v2 released as 2.00 known as 2.”uh oh...”. + Alexander Stepanov (STL designer) on standards: “Standards are not necessarily good or correct, but they are standards [...] traffic laws [for example] are very often idiotic. But we have to obey them otherwise we will die”. I loved all of these small nuggets of humorous archeological history on the language. Check it out!

Stanford Language Modeling from Scratch

Stanford has published their CS336 Undergraduate Module on Language Modeling from Scratch. Quality SotA learning material from Stanfrod? For free? What a time to be alive! This is one of the most practical public resources for ML engineers who want to move beyond API-level LLM usage, and instead actually understand the full foundation-model stack end-to-end. The course walks through building language models from first principles, including tokenization, Transformer architecture, optimizers, GPU/resource accounting, Triton kernels, FlashAttention-style optimization, distributed training, scaling laws, inference, evaluation, etc. The assignments are deliberately low-scaffolding and require substantial Python/PyTorch engineering - they are really chunky, so prepare to spend a few weeks actually plowing through these. The public GitHub repo also includes lecture materials and assignment repos, with the lecture repo structured around executable lecture files and PDFs. I am impressed by the level of depth and quality of this course, unsurprisingly from Stanford, but it is shocking when realising how much high quality content is available on one of the most important topics today.

Tokenomics Where Tokens are Used

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering. Super insightful research piece from Concordia University that analyses telemetry from agentic engineering clients across various development tasks. Some interesting findings presented: it seems that code review still dominates token usage consuming 59.4% of tokens on average; initial coding is only 8.6%. As part of the cost analysis, input tokens make up 53.9% of usage, suggesting that multi-agent systems are paying a large “communication tax” from repeatedly passing context between agents. For production ML practitioners building coding agents, a key insight is that cost control should focus on review/refinement loops, context management, and human-in-the-loop checkpoints rather than only optimizing model calls for generation. The main caveat is that the study is quite small, so it will be interesting as this is scaled further as likely new methodologies will emerge as well.

LLMs Need New Routing Paradigm

LLM inference has a routing problem: once every request can depend on cache locality, GPU specialization, and multi-step execution, the router directly affects latency, cost, and user experience. Modular put together a really interesting series on LLM Inference Routing, which is probably one of the most comprehensive breakdowns for scaling LLM serving beyond simple load balancing. Modular explains why traditional HTTP routers break down when GPU inference pods are stateful, heterogeneous, cache-sensitive, and sometimes split across prefill/decode execution paths. Part 1 frames the core problem around KV-cache residency, hardware specialization, conversation continuity, and multi-step request execution; Part 2 shows why the router needs a hot-path data layer that can query cached token blocks across pods in microseconds; and Part 3 turns that state into a composable routing pipeline covering preparation, filtering, scoring, picking, and execution. Lately I have seen quite a few posts discussing the challenges and potential solutions on LLM routing, so it is clear that not only this is a huge problem, but definitely understanding the foundations will be a useful skill as these tools become more ubiquitous.

NVIDIA RTX Spark Launch

Last week NVIDIA and Microsoft announced the release of the NVIDIA RTX Spark, and it’s really exciting to see that unified memory is finally coming outside of the M-Mac chips, and it seems that NVIDIA is indeed going all in. I found it actually surprisingly how small the chip actually is, and I struggle to process the claim that this chip has the same/similar performance to a 5070, when we are talking about fractions of the size compared to the video card. It was also impressive to see the small / thin size of the machines that were showcased with the chips, if this indeed matches the hype and expectations, it is clear that this will be another huge blow to the apple ecosystem with PCs finally being bullish on ARM processors. This also gives me hope for a proper local only rig that can run SotA models in consumer hardware, with the cost and extensibility of a PC. Let’s see how this actually develops!

Upcoming MLOps Events

The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.

Events we are speaking at this year:

Signals Conference - September @ Berlin
World Summit AI Europe - September @ Amsterdam

Other relevant events:

KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
AI Infra Summit 2026 - Sept @ California
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin

Open Source MLOps Tools

Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here’s a few featured open source libraries that we maintain:

SARC - Provides wrappers for popular agentic frameworks to enable guardrails and constraints that are enforced through the flow.
KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain

Please do support some of our open source projects by sharing, contributing or adding a star ⭐

About us

The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.

Check out our website

The Machine Learning Engineer

Ready for more?