Letting chatbots write Community Notes: bold move or recursive chaos?

PODCAST
ML Engineers Who Ignore LLMs Are Voluntarily Retiring Early

This one was special for me, as I played Co-founder Cupid introducing Yoni and Kostas.

It was a great chat too, about how inference is becoming a core data transformation step, while most tooling - like Spark - was built for structured, deterministic workloads. AI pipelines introduce non-determinism, GPU bottlenecks, and scaling issues that legacy infra can’t handle.

They explored what’s missing from today’s AI infrastructure:

Lineage needs to go deeper - Teams need row-level tracing to debug how specific inputs affect outputs.
Evals lack full context - Single-call evals miss issues that emerge across multi-step chains.
Reliability is its own job now - Expect to see roles focused solely on production-grade AI stability.

I’ll play Cupid again - click below to meet your next favorite episode.

Video || Spotify || Apple

HIDDEN GEMS

Give It Away // Gem // Song
A Google blog detailing the handover of its A2A framework to the Linux Foundation, outlining its architecture for interoperable, component-based AI systems.

Unified // Gem // Song
Netflix introducing UDA, a unified data architecture designed to standardize access patterns, improve data quality, and streamline integration across internal platforms.

By Design // Gem // Song
An Anthropic post describing the design of a multi-agent research system for running and evaluating collaborative AI agents in controlled, reproducible environments.

The Notes Between the Notes // Gem // Song
OpenAI’s model release notes outlining updates to GPT-4o, GPT-4, and GPT-3.5, including performance changes, tool availability, and system behavior improvements.

PODCAST
The Missing Data Stack for Physical AI

Training, weights, pushing the limits... this can be a pretty physical job.

But this time we were talking physical AI - systems where machine learning runs in the real world. Robots, spatial computing, sensor-driven setups. Messy, ambiguous environments where things shift constantly and nothing waits for your model to catch up.

We talked about how teams are handling that, including:

Visualizing time-based data across modalities – camera, motion, sensors – all synced up
Debugging across online and offline systems – and spotting data bugs before they show up in prod

Physical work more rewarding than your gym? Clicking below to listen.

Video || Spotify || Apple

WORLD TOUR
Practitioner-driven talks and panels, coming to you

ML stood for Miami Live last week, with talks on building trustworthy GenAI systems, designing scalable agent infra, and real-world agent UX + dev tooling.

Coming to a city near you:

Amsterdam - July 9 - talks from folks at weet.ai, orq.ai, and Pebbling.ai
Munich - July 16 - featuring streaming agents, semantic code lookup, real-world failures, and autonomous infra

RELEASE RADAR
MLflow 3: LoggedModel, GenAI workflows, and prompt evaluation

If you’ve ever had to wade through a sea of baseline_experiment_final_v2_reallyfinal_final, MLflow 3 might help clean things up.

The old run-centric structure is out. Now there’s a new LoggedModel object that ties together metadata from code, configs, traces, and evals - across both traditional ML and GenAI setups.

They’ve also rebuilt evaluation and monitoring with GenAI in mind. No more duct-taped dashboards - you can now track accuracy, latency, and cost out of the box.

Prompt engineering gets first-class treatment too:

Prompt Registry: Store, version, and document prompts properly.
Auto-tuning: Use eval feedback and labeled data to improve prompts.
Integrated evals: Built-in tools for measuring prompt performance.

And if you’re working with humans in the loop, you can now log annotations next to predictions - useful when you’re getting feedback from domain experts or tracking changes over time.

Documentation

Reslease notes

MEME OF THE WEEK

BLOG
The Great Data Divergence: Why Generative AI Demands a New Approach Beyond the Data Lake

Last week's Hot Take said the current approach to RAG and agents is fundamentally broken - this blog lays out the case in more detail.

It explains that GenAI systems like RAG need fast, contextual access to live data - something traditional data lakes, with their batch pipelines and delayed curation, just can’t deliver. The post argues for an API-first model where operational systems stay as sources of truth and are accessed directly.

Rather than copying everything into a lake just to make it usable, APIs offer a cleaner setup:

Agents query systems like Salesforce or Jira in real time
APIs handle access, monitoring, and governance
The data lake sticks around, but only for historical and analytical workloads

It’s not really a hot take to say you should read this one.

Read it here

ML CONFESSIONS

Messed up a feature flag during a model rollout and ended up sending the same push notification to around 40,000 users repeatedly over 15 minutes. The client thought they were being spammed, revoked the API key before anyone on our side noticed. Wasn't the model's fault, just a bad connection between the scoring service and the messaging queue. We fixed it in under an hour, had to sit through a very awkward call with their CTO.

Roll out your confession here.

HOT TAKE

Most LLM infra is built by people who’ve never debugged a broken embedding, let alone tracked it through a production pipeline.

Feeling seen after chasing down a broken vector? Or quietly offended because you’ve actually done the hard stuff? Let me know.

HOW WE CAN HELP

Working on something tricky or planning ahead? Here’s how we can help - just hit reply:

Custom workshops tailored to your company’s needs
Hiring? I know some quality folks looking for a new adventure
Want to connect with someone tackling similar problems? I can introduce you

Thanks for reading, catch you next time!

Building infra for AI pipelines

Letting chatbots write Community Notes: bold move or recursive chaos?

Keep reading

Newsletter