Mar 12

We thought retry + DLQ was enough

Case Studies

Nothing crashed—things just got worse.

After I sent the email “We skipped system design patterns, and paid the price” one subscriber replied and shared a lesson from the field.

(If you missed that email, you can read the story here)

Subscriber reply

Something we learned the hard way: sometimes the patterns matter less than the failure modes they create.

We had systems that “used the right patterns” on paper, but still failed quietly because we hadn’t thought through backpressure, retries, or blast-radius boundaries.

Nothing crashed — things just got worse.

Choosing the pattern was only half the design.

“Nothing crashed — things just got worse.” That line caught my attention.

Take this simple event pipeline below.

An upstream service receives orders from clients through an API and publishes a JSON message to a Kafka topic called payment-requests. A billing service consumes that message, converts the JSON into an XML format, and sends the request to an external system.

Retry + DLQ

Now imagine the external payment gateway slows down or becomes unavailable. The upstream service continues publishing messages, but the billing service cannot complete the request because the external system is not responding.

This is why most teams introduce retry logic and a Dead Letter Queue (DLQ).

Retries allow the system to recover from transient failures such as temporary network issues, short outages, or brief latency spikes from the external system. If the message still cannot be processed after several attempts, it is moved to a DLQ so it can be inspected later instead of blocking the pipeline.

Nothing crashed

Now back to the subscriber’s reply. He was not talking about transient failures. The downstream service was slowing down.

Imagine the external payment gateway taking 10–20 seconds to respond to each request. The response just takes longer than usual—No error is returned.

Meanwhile the upstream service continues taking orders. Messages keep getting published to the topic. The billing service keeps consuming them, but because it depends on the external system, each request takes much longer to complete. As a result, the billing service cannot process messages at the same rate they are being produced.

The queue begins to grow. Nothing crashes, but the system slowly falls behind.

The analogy

You can think of it like a restaurant kitchen. The waiters keep taking orders from customers and sending them to the kitchen. But, the chef is slowing down. Maybe the stove is not heating well, or each dish takes longer to prepare.

Orders start piling up above the chef. Nothing is broken, but the kitchen slowly falls behind.

The lesson

Retry and DLQ help when something fails. But, they do not solve the situation where work keeps arriving faster than the downstream can complete it.

The danger is quiet failure, a side of event-driven architecture that is rarely discussed.

If you're designing systems like the ones discussed here, this toolbox might help.

Free email delivery

The Software Architect Toolbox

The set of diagram pieces you can use to create awesome architecture visuals. The library package is regularly updated. Every time I find a new artifact or draw one myself, I add it because I think you should have it in your arsenal.