- Jul 17, 2025
Scaling Pay Bills with Async Messaging
At the end of each month, thousands of users log into our digital banking app at once to pay their bills. The surge in traffic was overwhelming our backend. We needed to handle 1,000 transactions per second (TPS) consistently, but our original system struggled to keep up.
The criminal was a government payment gateway that we relied on to process bill payments. On paper the gateway promised 99.99% uptime, but in practice it was often slow or unresponsive at peak times.
Our threads would get stuck waiting for the gateway’s response, clogging up the server. The UX team had designed the app to show an on-screen success message immediately after a bill payment, but with the gateway lagging or failing, that approach was blocking threads and dragging down the whole system.
We frequently saw lost transactions during peak hours because the synchronous calls to the external gateway couldn’t keep pace.
Here is the use cases analysis diagram to show you some context. It illustrates the retail customer most used use cases including the bill payment scenario.
The Solution: Async Messaging
To break free of the payment gateway bottleneck, we decided to decouple the bill payment process using asynchronous messaging.
In the new design, the user’s request doesn’t directly depend on an immediate response from the external payment gateway. Instead, as soon as a user hits “Pay Bill,” our system queues the request via a message broker and instantly returns a confirmation to the user that “your payment request was received.”
This non-blocking acknowledgment freed up the web server threads to handle new requests instead of waiting around.
The Architecture View
From the architecture diagram below, the front-end (mobile or web app) calls the retail service through the corresponding gateway, which publishes a payment event to the message broker instead of calling the government API synchronously.
The message broker routes the payment request to a dedicated worker service component running in the background. This worker component pulls messages off the queue and interacts with the government payment gateway on its own time.
Because the work is queued, we can scale out multiple workers to collectively process 1,000+ TPS without burdening the user-facing app servers.
Worker Model: Pay Bills in the Background
With this asynchronous setup, the bill payment workflow became resilient and scalable. The moment a payment request comes into the "payment-received" topic, our billing component validates it, communicates with the external payment gateway and produces a message onto the “payment-completed” topic if payment succeeded.
Other components subscribe to events as well – for example, an SMS Notifier and a Push Notifier components listen for payment outcome events. When the billing component gets a success or failure from the government API, it publishes a result message to another topic.
The SMS and Push components consume that and send the user an update via Twilio (SMS) and Firebase Cloud Messaging (push notification). Each step is logged for our analytics and monitoring dashboards, so we have end-to-end visibility.
Publishing to the broker is fast, and once the message is accepted, we immediately respond to the user in the mobile app confirming that their payment request is in progress.
If the payment gateway is slow or temporarily down, the request just stays in the queue until the worker can process it. Users aren’t stuck staring at a loading spinner anymore.
We also implemented a retry mechanism with exponential backoff setup in the billing component. Plus, the message goes into multiple retry kafka topics before hitting the dead letter queue. This frees the billing component to start processing other messages. That's a bigger topic and can be handled in a separate article. Hope the idea is clear for now.
Getting Internal Buy-In
Switching to an async model wasn’t just a technical challenge, it was also a people challenge. Our UX and marketing teams were initially wary about not showing an immediate “Bill Paid!” confirmation on the screen.
They worried users might be confused or think the app failed if we didn’t show instant success. To get everyone on board, we organized a couple of review sessions. We walked the UX team through real data showing how often the old approach led to timeouts and crashes during the month-end rush.
Seeing the raw failure rates and customer support tickets helped drive the point home. We explained that an immediate “payment received” message, followed by a confirmed status update (via notification), would actually improve user trust because the app would never hang or error out even if the external payment system was delayed. In other words, a slightly changed UX flow was far better than a broken one.
The marketing team was also concerned about how this might impact user perception. We addressed this by highlighting that many large-scale systems (like e-commerce orders or ticket bookings) use similar approaches. A quick confirmation and an email/SMS follow-up.
It’s a proven pattern and far more reliable for our use case. This reliability and resilience in the user experience convinced them. In the end, everyone agreed that a better success rate and performance at 1,000 TPS was well worth the event-driven user flow.
Reliability and Scalability Achieved
Implementing asynchronous messaging for bill payments paid off. We met the 1,000 TPS throughput requirement handily. In fact, our system now scales beyond that by just spinning more docker pods.
More importantly, the app stays responsive even during gateway slowness or downtime. No more thread pile-ups or overnight outages at month’s end.
Users quickly adapted to the new flow. They appreciate getting instant confirmation that their request is received, and they trust that the final confirmation will come shortly via notification.
Internally, we’re happier too. Our centralized logs storage give us rich data for analytics and monitoring, and the ops team can track the payment pipeline in real time on our dashboards.
By decoupling the user interaction from the external service call, we turned a fragile bottleneck into a robust process. In hindsight, switching to an async architecture wasn’t just about hitting a TPS number – it fundamentally improved the reliability and user trust in our digital banking platform.
Related Materials
See how a justified architect tailor the architecture diagrams for each target audience. Business, Devops, Leads, and Developers, each have distinct concerns. Communicate Like an Architect: Digital Bank Architecture Diagrams
Get exclusive deep dives, private notes, behind-the-scenes thinking, and raw experience from the field. Exclusive insights from the mind of a pragmatic architect.