What I Learned Reading Logs During a Major Incident
During a major incident, the logs are the difference between knowing something is broken and knowing what is broken. I learned more about observability, and about staying calm, from one bad night reading logs than from any document, and the gaps I hit that night became requirements.
The thing nobody tells you about a major incident is that the metrics tell you something is wrong and the logs tell you what. The dashboard goes red, throughput drops, errors spike, and that alarm is useful for about thirty seconds, until everyone is staring at it asking the only question that matters: what is actually broken. The answer is in the logs, and finding it means tracing individual transactions through the system under pressure, which is a skill, not a panic. I learned that skill the hard way, on a night when payments were failing and a room full of people were waiting for someone to say where. This is the developer analyst and systems analyst work that reading production logs prepares you for, and the broader skills are in The Technical Skills Guide for BAs.
What that night taught me was not really about logs. It was about how to think when a system is on fire, and about all the small observability gaps that are invisible on a good day and agonizing on a bad one. Here is what I took from it.
Metrics raise the alarm, logs tell the story
The first lesson is the division of labor between metrics and logs: metrics tell you that something is wrong and roughly where, and logs tell you what and why. In the incident, the metrics showed a spike in payment failures and a drop in completed transactions. That told us we had a problem and gave us a rough area, but it told us nothing about the cause. You cannot fix a spike; you fix the specific thing causing it, and the specific thing is in the logs.
So the metrics did their job, sounding the alarm and pointing us at the payment flow, and then we put them aside and went to the logs to find out what was actually happening to payments. This is the right sequence, and getting it wrong is a common mistake under pressure: people stare at the dashboards hoping the cause will appear, but a dashboard is an aggregate, and the cause is a specific failure in a specific transaction. The aggregate can tell you a thousand payments failed; only the logs can tell you they all failed at the same downstream call with the same error, which is the actual finding.
Learning to switch quickly from “the metrics say it is bad” to “let me read the logs and find out why” is what separates a useful incident responder from a worried bystander. The metrics are the smoke; the logs are where you find the fire. This is the same relationship as in a single payment investigation, scaled up to many transactions at once and under far more time pressure, and the calm comes from knowing the logs hold the answer if you read them methodically.
The correlation id is everything, until it breaks
The second lesson, learned painfully, is that the correlation id is the single most important thing in the logs, and the moment it breaks, the investigation gets much harder. In the incident, I picked one failing payment and traced its UETR across the services, and for most of the flow it worked beautifully, I could see the payment move from service to service, with the timing and the steps laid out. Then, at one service, the trail went cold.
That service did not log the UETR. It logged its own internal id instead, so when I searched for the payment’s UETR, its entries simply were not there, and the story stopped exactly where I needed it most, because that service turned out to be near the failure. I had to reconstruct that part by hand, correlating timestamps and amounts, which under incident pressure cost us real minutes that felt like hours. The payment’s trail, so clean everywhere else, broke precisely at the worst moment, and the reason was a logging gap that had been invisible until this exact situation made it agonizing.
That is when the lesson landed: a correlation id that is not carried through every service is a correlation id that will fail you during the incident when you need it most. On a normal day, nobody notices that one service logs a different id, because nobody is tracing a transaction across the whole flow under time pressure. During an incident, that gap is the difference between a five-minute diagnosis and a thirty-minute scramble. I had read about correlation ids; that night I felt why they matter, and it became one of the clearest requirements I have ever written: every service logs the end-to-end UETR on entry and exit, no exceptions.
Staying methodical is the actual skill
The third lesson is that the technical skill of reading logs is necessary but not sufficient; the harder skill is staying methodical when everyone wants an answer now. The pressure in a major incident is real, the business is losing money, stakeholders are asking for updates every few minutes, and the temptation is to jump to conclusions and chase theories. The discipline is to keep following the method even when it feels too slow.
The method that held up under pressure was the same one that works calmly: start broad to scope the problem, then narrow to one transaction and trace it. I resisted the urge to guess at causes and instead picked one failing payment and followed it, filtering to error and warning levels to find the failure fast, then widening to read the full context around it. That discipline kept me from the two failure modes of incident response: thrashing between theories without evidence, and tunneling on the first plausible cause without confirming it. Following one real transaction to the actual point of divergence gave us a grounded answer rather than a guess, and a grounded answer is what actually ends an incident.
What surprised me was how much the calm itself mattered. When I could say “I am tracing a specific failing payment, here is where it is breaking, here is the error,” it steadied the whole room, because it replaced anxiety with evidence. The methodical approach is not just more accurate; it is more reassuring to everyone waiting, which reduces the pressure that causes mistakes. That is a genuine production support skill, and it is mostly temperament: trust the method, follow the transaction, report what you actually see, not what you fear. The technical ability to read the logs is what makes the calm possible, because you are calm when you know you can find the answer.
Every incident is feedback on observability
The last lesson is that a major incident is the most honest feedback you will ever get on your system’s observability, and the gaps it exposes should become requirements while the pain is fresh. Everything that slowed the investigation, the broken correlation id, the service that logged nothing useful, the swallowed error, is a concrete observability gap, and the incident just showed you exactly how much it costs.
So after the incident, I do not just write the post-mortem of what broke; I write the observability requirements the incident revealed. Log the UETR at every service, the gap that cost us minutes. Log errors with enough context to diagnose them, because one service had logged a failure with no detail about which payment or why. Log key state transitions, so a stuck payment’s last known state is visible. Use log levels correctly, so errors are findable. Make sure everything is searchable in the central tool. Each of these is a requirement born directly from the incident, and each one makes the next incident faster to diagnose. The pain is the curriculum, and the requirements are what you learned.
This is the redemptive part of a bad night: it makes the system better if you let it. An incident that only produces a fix for the immediate cause has wasted most of its lessons; an incident that also produces observability improvements pays itself forward, so the next one is shorter and calmer. Treating incidents as feedback on diagnosability, and capturing that feedback as requirements, is how a technical business analyst turns the worst part of the job into lasting value, and it is exactly the analyze-code-test-support loop the role is built on. The skills to do it are in The Technical Skills Guide for BAs.
The takeaway
In a major incident, metrics raise the alarm and logs tell the story, so you switch quickly from the dashboard to tracing individual transactions by their correlation id. That id is everything, and the moment it breaks at a service boundary, the investigation gets much harder, which is a lesson you feel rather than read. Staying methodical under pressure, following one real transaction to the point of divergence, is the actual skill, and it steadies the whole room. And every incident is honest feedback on observability whose gaps should become requirements while the pain is fresh.
One bad night reading logs taught me more about building diagnosable systems than any document, and the requirements it produced made every later incident shorter. Start with The Technical Skills Guide for BAs, or browse everything at The Tech BA Toolkit.
Ahmed is a Senior Technical Business Analyst with 10+ years in banking and payments. He builds practical guides and tools for analysts at The Tech BA Toolkit.
Tags: Observability, Production Support, Technical Skills, Incident Management, Career Growth
Newsletter
Subscribe
Practical, no-fluff playbooks for technical analysts who analyze, code, test, and support. New articles straight to your inbox.
No spam. Unsubscribe anytime.