>_ Analyst Engineering

How a Technical BA Investigates a Failed Payment

Cover for a field-notes article on how a technical business analyst investigates a failed payment, following one transaction.

A failed payment investigation is just one question asked at every hop: where is this transaction now, and what state is it in. You follow one payment from the complaint to the cause, and the point where reality diverges from the expected flow is your answer.

When someone tells me a payment failed, I do not start by theorizing about what went wrong. I start by finding the transaction and following it through every system it touched, asking the same question at each hop: where is it now, and what state is it in. The API response it got, the event it should have published, the database row with its status and reason code, the logs across services, the status report it produced, each is a place to ask that question, and the point where the answer stops matching the expected flow is the cause. That is the whole method, and it is the same discipline I use to learn any system by testing it, pointed at a single broken transaction instead of a test case.

This is the work that made me a technical business analyst rather than someone who forwards questions to developers. I can do this investigation myself, in minutes, because I can query the database, read the logs, and trace the flow, and that independence is the whole job. The skills behind it are the spine of The Technical Skills Guide for BAs. Here is how a real one goes.

It starts with a vague complaint

Almost every investigation begins with a report that is missing the one thing I actually need. “A customer says their payment failed.” When? Which payment? What did they see? The first job is not technical at all; it is turning a vague complaint into a specific transaction I can track, because I cannot follow a payment I cannot identify.

So I ask for the identifier. The best case is the UETR, the unique end-to-end transaction reference that is carried through the whole flow precisely so a single payment can be tracked everywhere. Failing that, the end-to-end id, or enough detail, the amount, the approximate time, the beneficiary, to find the transaction in the database with a query. This is where having an identifier that flows through every system pays off, and where a system that loses or regenerates that id makes my life much harder, which is itself a finding I have raised more than once. The identifier is the thread; without it, I am stitching the story together from fragments.

Once I have a candidate transaction, I confirm it is the right one before I go further, matching the amount, time, and parties against the complaint, because investigating the wrong payment wastes everyone’s time. This sounds obvious, but under the pressure of an upset stakeholder it is easy to grab the first transaction that looks close and chase a phantom. The discipline of pinning down exactly which transaction, by a reliable id, before investigating is the same precision that makes good requirements: be specific about the thing you are talking about. With the UETR in hand, the real work begins.

I follow the transaction, hop by hop

With the identifier, I follow the payment through the system the way it actually moved, checking the state at each hop against what should have happened. This is the core of the investigation, and it mirrors the end-to-end flow a payment takes: ingestion, event, processing, database, status, callback.

First, the database. I query the payment’s row and read its real status and reason code, which immediately narrows the problem. A status of REJECTED with reason code AC04 tells me the payment was rejected for a closed account, a clear answer. A status stuck at a pending or in-progress value tells me the payment got stuck somewhere, a very different problem. The database is where I usually start, because the stored state is the fastest single indicator of what happened, and querying it with SQL takes seconds.

select status, reason_code, created_at, updated_at
from payments where uetr = 'abc-123';
-- status REJECTED, reason_code AC04  -> rejected, closed account
-- status RCVD, updated_at hours ago  -> stuck after receipt

Then the logs. I search the logs by the UETR across all services and order them by time, reconstructing the transaction’s journey. The log trail shows me which services it touched, the timing, and crucially any errors or warnings. If the payment is stuck, the logs usually show where it stopped, the service that errored, the downstream call that timed out, the event that never got consumed. If the correlation id breaks at some service, the trail goes cold there, which tells me both where to look and that there is an observability gap worth raising. And the status and events: I check what the status endpoint reports and, when relevant, whether the expected event fired on the queue, because a payment that did not publish its event failed upstream of wherever everyone assumed.

The point of divergence is the answer

The cause of a failed payment is the hop where reality diverges from the expected flow. Once I have followed the transaction, I compare what happened at each step to what should have happened, and the first place they disagree is where the failure lives. Everything before that point worked; the divergence is the problem.

This framing is what makes the investigation reliable rather than a guessing game. I am not hunting for a culprit at random; I am walking a known-good path and looking for the first deviation. The payment should have gone received, accepted, settled; if the database shows it stuck at received with a downstream timeout in the logs, the divergence is at the processing hop, and the cause is the timeout. The payment should have produced a pacs.002 with ACCP; if it produced RJCT with AC04, the divergence is at validation, and the cause is the closed account. The expected flow is the reference, the actual flow is the evidence, and the gap between them is the finding. This is exactly the state machine thinking of knowing the legal path and spotting where the transaction left it.

The causes themselves vary, validation failures, insufficient funds, downstream timeouts that strand a payment, duplicate-handling problems, data that breaks a scheme’s rules, but the method does not. I follow the transaction and find the divergence, regardless of the underlying cause. That is why I can investigate a payment failure in a system I do not know well: I do not need to know in advance what went wrong, I need to follow one transaction and watch for the point where it stops behaving as it should. This is the same skill as testing a payment, where you deliberately break a payment and trace where it diverges; investigation is testing run backward, from the symptom to the cause.

The investigation is where requirements are born

The most valuable part of a payment investigation is not fixing the one payment; it is what the investigation reveals about the system that becomes a requirement. Every investigation that turns up a gap, a broken correlation id, a stuck-payment state with no recovery, a reason code that produced a misleading customer message, is a requirement waiting to be written.

This is where the business analyst and the QA analyst in me take over from the investigator. When I find that a payment got stuck because a downstream timeout had no recovery path, that is not just this payment’s problem; it is a missing requirement that every payment hitting that timeout will hit. When I find that the customer saw “payment failed” for what was actually a closed account, that is a reason code mapping gap that misleads every customer in that situation. When I find that the UETR was dropped at one service, that is an observability requirement: every service must log the end-to-end id. The single investigation surfaces a systemic fix, and writing that fix as a precise requirement is where the lasting value is.

So I treat every investigation as analysis, not just support. I note what I had to do to track the payment down, the gaps that made it hard, the behaviors that surprised me, and I turn them into requirements and improvements. Over time this makes the system more diagnosable and the failures less frequent, because the investigations feed back into the design. That loop, investigate, learn, specify, improve, is the difference between an analyst who closes tickets and one who makes the system better, and it is the heart of why production support is such a powerful teacher for analysts. The full toolkit for working this way is in The Technical Skills Guide for BAs.

The takeaway

Investigating a failed payment is one question asked at every hop: where is this transaction, and what state is it in. You turn a vague complaint into a specific transaction by its UETR, follow it through the database, the logs, the events, and the status, and find the point where reality diverges from the expected flow, that divergence is the cause. The method is the same regardless of what actually went wrong, which is why a technical BA can investigate a payment in a system they barely know.

And the best investigations do not end with the one payment; they surface the gaps that become requirements, feeding back to make the system more diagnosable and reliable. Start with The Technical Skills Guide for BAs and Break Into Banking, or browse everything at The Tech BA Toolkit.

Ahmed is a Senior Technical Business Analyst with 10+ years in banking and payments. He builds practical guides and tools for analysts at The Tech BA Toolkit.

Tags: Business Analysis, Payments, Production Support, Banking, Career Growth

Newsletter

Subscribe

Practical, no-fluff playbooks for technical analysts who analyze, code, test, and support. New articles straight to your inbox.

No spam. Unsubscribe anytime.