Data Lineage: Trace Every Number Back to Its Source
Written by Ahmed at Analyst Engineering, a Senior Technical Business Analyst with 10+ years in banking and payments delivery.
Key takeaways
- Data lineage is the map of how data flows from source systems through every transformation to the reports that display it, ideally down to the individual column.
- Lineage answers the two questions that matter most in data work: where did this number come from (trust and audit), and what breaks if I change this field (impact analysis).
- Table-level lineage tells you which datasets feed which; column-level lineage tells you that reported_total was computed from fx_amount which was computed from amount, which is what an auditor actually asks for.
- Lineage is requirements traceability applied to data: the same forward and backward walk, with the same payoff, proving completeness and making change safe.
Data lineage is the map of how every number travels from its source system, through each transformation, to the report that displays it. It answers backward, where did this figure come from, and forward, what breaks if I change this field, and in banking those answers are regulatory obligations, not conveniences.
Data lineage is the documented path data takes from source systems through every transformation, join, and aggregation to the tables and reports that finally display it, ideally traced at the level of individual columns. Complete lineage lets you pick any figure on any dashboard and walk it backward hop by hop to the fields it came from, and pick any source field and walk it forward to everything it feeds. Those two walks are the point: the backward walk is how you defend a number to an auditor, and the forward walk is how you change a field without silently breaking nine reports you have never heard of. If you have built a requirements traceability matrix, you already know this structure; lineage is the same discipline aimed at data instead of requirements.
What does lineage actually look like?
Here is column-level lineage for one number, a settlement total on a regulatory report, traced back to its source:
Read backward from the report: reported_total is the sum of fx_amount by value date; fx_amount is amount times the daily rate; amount was typed and validated from the core system’s field. Every hop names its transformation, so when the total is challenged, the explanation is a walk, not an investigation. Read forward from the source: if the core system changes the semantics of amount, the lineage tells you, before the change ships, that a regulatory figure is downstream.
What questions does lineage answer, and for whom?
Backward: where did this number come from? This is the trust question, and it arrives with stakes attached: an executive challenging a dashboard, an auditor sampling a filing, a regulator applying BCBS 239, the Basel principles that require banks to trace risk figures to their sources. With column-level lineage the answer is minutes; without it, someone spends days reading SQL, which is the data equivalent of tracing a payment without a correlation id. The lineage is the data’s UETR.
Forward: what breaks if this changes? This is impact analysis, and it is where lineage saves the most money. A source team renames a field, tightens a type, or changes a code’s meaning, and every downstream consumer is at risk, mostly without knowing they are consumers. Forward lineage turns the late-change scramble into the same fast, reliable query that a traceability matrix gives you when a scheme rule changes: find the field, follow it forward, list the affected models and reports, done. It is also the data world’s version of the problem contract testing solves for APIs: the provider who cannot see their consumers will eventually break them.
Sideways: is this the same number? When two reports disagree, lineage shows where their paths diverged, which transformation one applied and the other did not. That turns the recurring “why do these two dashboards disagree” dispute into a diffable pair of paths, the analytical sibling of a reconciliation break investigation.
How is lineage captured without decaying?
The failure mode of lineage is the same as every manually maintained document: it is accurate the week it is written and fiction within a quarter. The defense is to capture lineage from the code that actually runs, so it cannot drift. Transformation frameworks do this natively, dbt builds its DAG from the ref() calls in your models, so the dependency graph is always exactly what the code says. Lineage tools parse SQL to extract column-level derivations, and pipeline standards like OpenLineage let jobs emit lineage metadata as they run. What automation cannot see, the spreadsheet step, the manual adjustment, the vendor black box, must be documented by hand, which is precisely why those hops should be eliminated where possible: every manual hop is a lineage gap, and a lineage gap is where the auditor’s question stops having an answer.
For an analyst, lineage work leans on skills already in the kit. Reading the transformations is SQL. Pinning down what each field means at each hop is the data dictionary. Verifying the lineage is honest, that the documented path matches what actually runs, is the same test-it-yourself discipline you apply to any system: pick one number on one report and walk it back yourself. If the walk and the documentation disagree, you have found either a lineage gap or a defect, and both are findings.
The takeaway
Data lineage maps how every field flows from source through each transformation to the reports that display it, at its best down to individual columns with each hop’s transformation named. Backward it answers where a number came from, which in banking is a regulatory obligation; forward it answers what a change will break, which is impact analysis; sideways it explains why two reports disagree. Capture it from code rather than documents so it cannot decay, and verify it the way you verify everything else: pick a number and walk the path yourself.
A report figure without lineage is a claim. With lineage, it is a conclusion.
About the author
Analyst Engineering is written by Ahmed, a Senior Technical Business Analyst with 10+ years of banking and payments delivery experience: ISO 20022 and SWIFT messaging, payments API integration, Kafka event validation, and production support. Every article comes from real delivery work, and each one is reviewed and updated as tools and standards change.
Related articles
- The Requirements Traceability Matrix: From Requirement to Test, Proven What a requirements traceability matrix is, how to build one, and why it proves every requirement is designed, built, and tested. With a payments example.
- Reading a dbt DAG: The Map of How Your Data Is Built A dbt DAG is the dependency graph of your transformations, built from ref() calls. How to read one, the staging to marts convention, and why tests live on nodes.
- The Data Dictionary: Every Field, Defined Once What a data dictionary is, how to build one, and why a single authoritative definition of every field prevents the ambiguity that breaks integrations and reports.
- Reconciliation Design: Proving Two Systems Agree How to design reconciliation between systems: matching keys, break detection, tolerance, timing, and exception handling. The control that proves the money is right.
Newsletter
Subscribe
Practical, no-fluff playbooks for technical analysts who analyze, code, test, and support. New articles straight to your inbox.
No spam. Unsubscribe anytime.