Issue #392 - The ML Engineer š¤
Data Dog's State of AI Engineering, 10 Year of Clickhouse, Cornell: Advanced Compilers, DuckDB Internals and Speed, Bringing GPU Kernels to Rust + more š
Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps š¤ You can join the newsletter https://bit.ly/state-of-ml-2025 ā
If you like the content please support the newsletter by sharing with your friends via āļø Email, š¦ Twitter, š¼ Linkedin and š Facebook!
This week in ML Engineering:
Data Dogās State of AI Engineering
10 Year of Clickhouse
Cornell: Advanced Compilers
DuckDB Internals and Speed
Bringing GPU Kernels to Rust
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more š
Data Dogās State of AI Engineering
Datadog has analysed LLM telemetry data from more than a thousand Datadog customers and published a snapshot on the āState of AI Engineeringā - hereās some highlights: Telemetry data shows that more than 70% of organizations use three or more models, agent framework adoption has nearly doubled and 69% of input tokens are system prompts. In regards to routing, prompt caching is still only present in 28% of eligible calls, and rate limits are one of the biggest reliability problems in production LLM calls. Some trends show that teams are now running multi-model stacks, heavier agent frameworks, long system prompts, and tool-heavy workflows that behave much more like distributed systems than simple API integrations. It is clear that AI engineering now needs the same discipline as platform engineering (+ always has!), including model gateways, continuous evals, prompt layout hygiene, caching, context engineering, budgets, backoff, queues, and fallback capacity. The report also shows that most agents are still fairly simple from a service-topology perspective, with 59% making only one service call, so the next set of engineering challenges will come from tracing, debugging, and governing these systems as they become more distributed.
ClickHouse just turned 10 years old as an open-source project, and their CTO Alexey Milovidov published a fantastic engineering history of what it takes to build a production database from actual painpoints: ClickHouse evolved from OLAPServer and Metrage into a from-scratch columnar DBMS. They decided to add key differentiating features like in-memory columns, aggregate functions, table engines, compression, SQL parsing, MergeTree for background sorting, and ReplicatedMergeTree for multi-DC production use. This came up from wanting to address the pains of growing data volumes, real-time logs, slow MySQL paths, custom C++ data structures, and users waiting for analytics to load. For us ML Engineering / MLOps practitioners itās a good reminder on how AI systems now need exactly the same properties ClickHouse was built around, including fast analytical queries over high-volume event data, long-retention observability, cheap aggregation, and infrastructure that can handle messy real-time workloads. It is also a good reminder that serious open source is not just throwing code on GitHub; ClickHouse really shows how they spearheaded this domain as well.
Compilers are one of the most fun and challenging domains in computer science, and they are becoming a production ML topic again - this course from Cornell is a fantastic deep dive into modern / advanced compiler concepts: Especially given the importance of efficiency in ML these days, we can no longer expect to throw larger and more GPUs, especially as teams run into graph breaks, custom kernels, dynamic shapes, hardware-specific inference paths, and performance issues that cannot be solved by just asking for bigger hardware. Cornellās CS 6120 Advanced Compilers course is now available as a self-guided online course, and it looks like a great resource for ML practitioners who want to understand the systems layer underneath torch.compile, XLA, MLIR, Triton, LLVM, and modern inference stacks. The course is PhD-level but still very hands-on, covering intermediate representations, data flow, SSA, local/global optimizations, loop optimization, LLVM passes, alias analysis, garbage collection, JIT/dynamic compilation, parallelism, and fast compilers through classic papers and open-source implementation tasks using LLVM and Bril. I havenāt picked up the dragon book since back in university, and thatās one of the books / courses Iāve enjoyed the most, so I am definitely adding this to my todo list, hopefully I get some time (as the list is only growing larger!!).
DuckDB is becoming one of the default local engines for ML/data workflows, but have you wondered whatās going on in its internals to work so well? Basically a lot of the speed comes from the runs in-process, so Python/R applications avoid the server round trip and a lot of row-by-row serialization overhead that still shows up with ODBC/JDBC-style paths. This requires understanding the query setup path, including parsing, binding, optimizer passes such as filter pushdown, subquery unnesting, join ordering, and runtime join-filter pushdown. This then follows by the physical plan which is split into pipelines separated by sinks like GROUP BY, ORDER BY, and hash-join build phases. DuckDBās native format and Parquet both give it columnar reads, row-group statistics / zone maps, and byte-range reads on remote files, so many feature analysis, eval, debugging, and batch analytics workloads can stay in simple Parquet + SQL without immediately reaching for a warehouse job or distributed cluster. This multi-part series is a great way to get started into the DuckDB internals, so definitely recommended as a deep dive resource.
NVIDIA has published a new framework + paper to bring Rustās memory safety into GPU Kernels! Unfortunately still CUDA-specific, but this does look like quite an exciting leap! We all know that GPU kernel work is becoming a much bigger part of production ML engineering, especially as teams push harder on custom approaches to unlock every bit of performance. This paper introduces cuTile Rust, a tile-based GPU kernel system that brings Rustās ownership model across the CPU/GPU boundary, including mutable tensors that are split into disjoint partitions, immutable tensors that are shared safely, and kernel launches that preserve ownership while GPU work is still running. Itās impressive to see that the safety features donāt seem to add a major performance bottleneck - on B200 they report 7 TB/s for elementwise operations and around 2 PFLOP/s for GEMM, reaching 96% of cuBLAS. For us ML Engineering / MLOps practitioners this could be the start of an exciting trend, which although right now is limited to NVIDIA/CUDA, most likely very soon this could come to other frameworks like Vulkan, and unlock cross-vendor GPU compute.
Upcoming MLOps Events
The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.
Events we are speaking at this year:
Signals Conference - September @ Berlin
World Summit AI Europe - September @ Amsterdam
Other relevant events:
KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
AI Infra Summit 2026 - Sept @ California
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin
In case you missed our talks, check our recordings below:
The State of AI in 2025 - WeAreDevelopers 2025
Prod Generative AI in 2024 - KubeCon AI Day 2025
The State of AI in 2024 - WeAreDevelopers 2024
Responsible AI Workshop Keynote - NeurIPS 2021
Practical Guide to ML Explainability - PyCon London
ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
Metadata for E2E MLOps - Kubecon NA 2022
ML Performance Evaluation at Scale - KubeCon Eur 2021
Industry Strength LLMs - PyData Global 2022
ML Security Workshop Keynote - NeurIPS 2022
Open Source MLOps Tools
Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ā github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Hereās a few featured open source libraries that we maintain:
SARC - Provides wrappers for popular agentic frameworks to enable guardrails and constraints that are enforced through the flow.
KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain
Please do support some of our open source projects by sharing, contributing or adding a star ā
About us
The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.
