avatar.jpg

Melchi

I’m a technologist based in Melbourne, working across cloud, AI infrastructure, GenAI, and accelerated compute.

I spend most of my time helping teams make sense of complex technology decisions: how to run AI workloads efficiently, how to design cloud platforms that scale, how to think about cost and performance, and how to turn fast-moving ideas into systems that are useful in the real world.

This blog is my place to slow those ideas down and explain them properly. I write about LLM inference, GPUs, AWS, Kubernetes, distributed systems, cloud cost, software engineering, and the trade-offs that sit underneath modern AI and platform architecture.

My goal is simple: make difficult technical topics easier to understand, while still going deep enough to be useful.

The views here are my own.

Understanding KV Cache: The Hidden Memory Cost of Serving LLMs

How attention architectures evolved to keep KV cache from eating your GPU, and what that means if you self-host.

Already comfortable with KV cache and attention? Skip the theory and jump straight to the interactive KV Cache Calculator to size VRAM for your model, batch size, and target GPU.

If you’re planning to self-host a large language model, you’ve probably sized VRAM based on parameters alone. A 70B model in BF16 needs roughly 140 GB just for weights. That’s the easy part: 70 billion parameters × 2 bytes.

Rate limiting in Golang HTTP client

I’ve been doing some interesting work with the team at MFlow writing HTTP clients that consume financial data, and it’s been eye-opening to see how different API platforms choose to protect their resources. Best practices for client-side rate limiting seem to be scarce when compared to server-side, so here are my thoughts on the subject and some code samples.

TL;DR — wrap *http.Client and call limiter.Wait(ctx) before every request, where limiter is a *rate.Limiter from golang.org/x/time/rate . The token bucket honours bursts, blocks cleanly when you’re out of tokens, and respects context cancellation.

Cloud cost management

TL;DR — Compute is usually the largest line item on your cloud bill. Bills tell you what you spent, not what you used. Measure utilisation with percentiles (P95/P99), not averages. Prefer always-on elastic infrastructure over scheduled shutdowns, and let Kubernetes bin-pack workloads to squeeze more value out of every node.

Back in 2015, public cloud services were not well understood. Large enterprises debated whether migrating to the cloud would meet their security requirements, paralysed by fear of the unknown. We have come a long way since — digital transformation is now synonymous with cloud migration. The benefits of on-demand infrastructure and elasticity have made engineers more productive and businesses happier with the promise of improved time-to-market.

Securing your CaaS using Google's gVisor

TL;DR — A standard Linux container is an isolation boundary, not a security boundary. Every container on a host shares one kernel, so a single kernel exploit can compromise the whole node. gVisor inserts a user-space kernel (runsc) between your container and the host, dramatically shrinking the attack surface. It’s now production-grade — Google runs Cloud Run, App Engine and Cloud Functions on it — and integrates cleanly with containerd and Kubernetes via RuntimeClass.