Issue #381 - The ML Engineer 🤖

Alibaba Releasing QWEN-3.6, Google DeepMind on TurboQuant, MIT FlowMatch & Diffusion Models, NGROK on Quantization from Scratch, Stanford Transformer Course + more 🚀

Apr 05, 2026

Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://ethical.institute/mle.html ⭐

If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

This week in ML Engineering:

Alibaba Releasing QWEN-3.6
Google DeepMind on TurboQuant
MIT FlowMatch & Diffusion Models
NGROK on Quantization from Scratch
Stanford Transformer Course
Open Source ML Frameworks
Awesome AI Guidelines to check out this week
+ more 🚀

Alibaba Releasing QWEN-3.6

China’s Alibaba continues to challenge the AI status quo, now with the release of QWEN-3.6-plus which brings some really interesting performance innovations: Qwen3.6-Plus is Alibaba’s latest hosted frontier model for real-world agent workflows with the biggest gains in agentic coding + tool use + multimodal reasoning. For production ML practitioners, the practical takeaway is that Qwen is optimizing more for end-to-end task completion across repository-level coding, terminal operations, web/UI generation, document and video understanding, and multimodal agent loops. The model exposes a 1M-token context window which supports OpenAI- and Anthropic-compatible APIs via Model Studio and adds a preserve_thinking option whcih improves multi-step agent consistency as well as reducing redundant reasoning. From the benchmarks we can see that (allegedly) it is especially strong on coding-agent, planning, multilingual, OCR/document, and visual grounding tasks; so far performance remains mixed versus top competitors on some general reasoning and long-context evaluations. However overall this seems quite interesting to see the aggressive and close competition on a space that we were considering unbeatable only a year ago.

Google DeepMind on TurboQuant

Google DeepMind redefines AI efficiency with extreme compression with TurboQuant: TurboQuant removes much of the usual metadata overhead in vector quantization, letting KV caches and vector indices run at much lower bitwidths without the normal quality penalty. The method that TQ introduces combines compression with a 1-bit Quantized residual correction step which preserves attention accuracy and keeps memory overhead near zero (which is great). In Google’s reported experiments on long-context benchmarks and vector search, TurboQuant compressed KV cache representations down to 3 bits without training or fine-tuning, delivered at least 6x KV memory reduction on needle-in-a-haystack tasks, and showed up to 8x faster attention-logit computation at 4-bit versus 32-bit keys on H100s. For production ML practitioners, the takeaway is that these type of optimization opportunities can drive really clear opportunity for teams with lower memory bandwidth pressure, cheaper long-context serving, and faster high-dimensional retrieval with minimal accuracy tradeoff.

MIT FlowMatch & Diffusion Models

Super excited to see a brand new 2026 course from MIT on the same models that power OpenAI, Anthropic and other LLM giants: MIT is providing their 2026 course for FREE on

6.S184 diffusion and flow matching models for practitioners that want to get hands-on experience. This course is a comprehensive introduction and covers the math behind modern generative models across ODEs, SDEs, the Fokker–Planck equation, score matching, classifier-free guidance, latent diffusion, and discrete diffusion. This course also includes comprehensive hands on labs that walk learners through building key components and ultimately a latent diffusion model from scratch. For production ML practitioners, this brings really a lot of value, not just about shipping a model, but more about gaining the conceptual and hands-on foundation needed to understand how today’s image and video generators work.

NGROK on Quantization from Scratch

Quantization is now one of the most important methods to drive high performance efficiencies in real-world AI at scale, and this is a great deep dive from scratch: For production ML practitioners quantization matters as model size is dominated by weights which more are non-relevant (aka zero-valued); with quatnization we can make LLMs surprisingly tolerant to storing parameters in lower-precision formats or compact integer representations instead of full-precision floats resulting in major savings. The key takeaway on Qwen3.5 9B, is that 8-bit quantization can preserve quality almost entirely even with 8-bit quantization; with 4-bit quantization there is modest degradation (2-bit quantization starts to collapse). This is a great post from ngrok, do check it out for the deep dive into ML quantiaztion.

Stanford Transformer Course

Stanford is releasing their new 2026 course on Transformer models for FREE! CS25 Transformers United V6 has fantastic and updated content for practitioners to dive into the field: This Stanford course has a broad coverage on the evolving frontier beyond vanilla transformer models and into the relevant architectures powering the field today. Rather than teaching one deployment recipe, this course curates talks from leading researchers and practitioners across core model architectures and adjacent paradigms, which cover transformers, JEPA, state space models, and real-world perspectives from companies like Hugging Face, Anthropic, DeepMind, and Modal. This may end up being one of the most relevant courses on the topic once all the lectures are updated, make sure to keep an eye as the course material becomes available!

Upcoming MLOps Events

The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.

Events we are speaking at this year:

Signals Conference - Sept @ Berlin
World Summit AI Europe - Sept @ Amsterdam

Other relevant events:

KubeCon Europe - March @ Amsterdam
PyData Berlin - April @ Frankfurt
Databricks Summit - June @ San Francisco
World Developer Congress - July @ Berlin
EuroPython 2026 - July @ Prague
EuroSciPy 2026 - July @ Krakow
Code.Talks 2026 - Nov @ Hamburg
MLOps World 2026 - Nov @ Austin

In case you missed our talks, check our recordings below:

The State of AI in 2025 - WeAreDevelopers 2025
Prod Generative AI in 2024 - KubeCon AI Day 2025
The State of AI in 2024 - WeAreDevelopers 2024
Responsible AI Workshop Keynote - NeurIPS 2021
Practical Guide to ML Explainability - PyCon London
ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
Metadata for E2E MLOps - Kubecon NA 2022
ML Performance Evaluation at Scale - KubeCon Eur 2021
Industry Strength LLMs - PyData Global 2022
ML Security Workshop Keynote - NeurIPS 2022

Open Source MLOps Tools

Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 20,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Here’s a few featured open source libraries that we maintain:

KAOS - K8s Agent Orchestration Service for managing the KAOS in large-scale distributed agentic systems.
Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced data processing usecases.
Production ML Tools - A curated list of tools to deploy, monitor and optimize machine learning systems at scale.
AI Policy List - A mature list that maps the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
Agentic Systems Tools - A new list that aims to map the emerging ecosystem of agentic systems with tools and frameworks for scaling this domain

Please do support some of our open source projects by sharing, contributing or adding a star ⭐

About us

The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.

Check out our website

The Machine Learning Engineer

Ready for more?