What is data lineage?

Data lineage is the documented path data takes from its source systems through every transformation, join, and aggregation to the tables, dashboards, and reports that finally display it. Complete lineage works at column level: it can show that a reported total was computed from a converted amount, which was computed from a source amount field in a specific system. It is the map that makes data explainable and change safe.

Why is data lineage important?

Lineage answers the two highest-stakes questions in data work. Backward: where did this number come from, which is what an auditor, a regulator, or a skeptical executive asks when a figure is challenged. Forward: what is affected if this field changes, which is the impact analysis that prevents a source system change from silently breaking downstream reports. Without lineage both questions are answered by archaeology and guesswork.

What is the difference between table-level and column-level lineage?

Table-level lineage shows which datasets feed which: the payments staging table feeds the payments mart, which feeds the settlement dashboard. Column-level lineage goes further and shows how individual fields derive from one another, including the transformations applied: reported_total is the sum of fx_amount, which is amount multiplied by the day's rate. Table level is enough for orientation; column level is what audits and precise impact analysis require.

How does data lineage relate to regulatory requirements like BCBS 239?

BCBS 239, the Basel Committee's principles for risk data aggregation and reporting, requires banks to demonstrate that risk figures are accurate, complete, and traceable to their sources. In practice that means being able to walk a reported number back through every transformation to the originating systems, which is precisely column-level lineage. In banking, lineage is not a nice-to-have; it is how you answer the regulator.

How do you build or capture data lineage?

Lineage is captured three ways, usually combined: parsed automatically from transformation code, which is how dbt builds its DAG from ref() calls and how lineage tools parse SQL; emitted by pipelines as metadata using standards like OpenLineage; and documented manually for the hops automation cannot see, such as a spreadsheet step or a manual adjustment. Automated capture stays current; manual capture decays, so push lineage into code wherever possible.

Data Lineage: Trace Every Number Back to Its Source

Written by Ahmed at Analyst Engineering, a Senior Technical Business Analyst with 10+ years in banking and payments delivery.

Data lineage is the map of how every number travels from its source system, through each transformation, to the report that displays it. It answers backward, where did this figure come from, and forward, what breaks if I change this field, and in banking those answers are regulatory obligations, not conveniences.

Data lineage is the documented path data takes from source systems through every transformation, join, and aggregation to the tables and reports that finally display it, ideally traced at the level of individual columns. Complete lineage lets you pick any figure on any dashboard and walk it backward hop by hop to the fields it came from, and pick any source field and walk it forward to everything it feeds. Those two walks are the point: the backward walk is how you defend a number to an auditor, and the forward walk is how you change a field without silently breaking nine reports you have never heard of. If you have built a requirements traceability matrix, you already know this structure; lineage is the same discipline aimed at data instead of requirements.

What does lineage actually look like?

Here is column-level lineage for one number, a settlement total on a regulatory report, traced back to its source:

Read backward from the report: reported_total is the sum of fx_amount by value date; fx_amount is amount times the daily rate; amount was typed and validated from the core system’s field. Every hop names its transformation, so when the total is challenged, the explanation is a walk, not an investigation. Read forward from the source: if the core system changes the semantics of amount, the lineage tells you, before the change ships, that a regulatory figure is downstream.

What questions does lineage answer, and for whom?

Backward: where did this number come from? This is the trust question, and it arrives with stakes attached: an executive challenging a dashboard, an auditor sampling a filing, a regulator applying BCBS 239, the Basel principles that require banks to trace risk figures to their sources. With column-level lineage the answer is minutes; without it, someone spends days reading SQL, which is the data equivalent of tracing a payment without a correlation id. The lineage is the data’s UETR.

Forward: what breaks if this changes? This is impact analysis, and it is where lineage saves the most money. A source team renames a field, tightens a type, or changes a code’s meaning, and every downstream consumer is at risk, mostly without knowing they are consumers. Forward lineage turns the late-change scramble into the same fast, reliable query that a traceability matrix gives you when a scheme rule changes: find the field, follow it forward, list the affected models and reports, done. It is also the data world’s version of the problem contract testing solves for APIs: the provider who cannot see their consumers will eventually break them.

Sideways: is this the same number? When two reports disagree, lineage shows where their paths diverged, which transformation one applied and the other did not. That turns the recurring “why do these two dashboards disagree” dispute into a diffable pair of paths, the analytical sibling of a reconciliation break investigation.

How is lineage captured without decaying?

The failure mode of lineage is the same as every manually maintained document: it is accurate the week it is written and fiction within a quarter. The defense is to capture lineage from the code that actually runs, so it cannot drift. Transformation frameworks do this natively, dbt builds its DAG from the ref() calls in your models, so the dependency graph is always exactly what the code says. Lineage tools parse SQL to extract column-level derivations, and pipeline standards like OpenLineage let jobs emit lineage metadata as they run. What automation cannot see, the spreadsheet step, the manual adjustment, the vendor black box, must be documented by hand, which is precisely why those hops should be eliminated where possible: every manual hop is a lineage gap, and a lineage gap is where the auditor’s question stops having an answer.

For an analyst, lineage work leans on skills already in the kit. Reading the transformations is SQL. Pinning down what each field means at each hop is the data dictionary. Verifying the lineage is honest, that the documented path matches what actually runs, is the same test-it-yourself discipline you apply to any system: pick one number on one report and walk it back yourself. If the walk and the documentation disagree, you have found either a lineage gap or a defect, and both are findings.

The takeaway

Data lineage maps how every field flows from source through each transformation to the reports that display it, at its best down to individual columns with each hop’s transformation named. Backward it answers where a number came from, which in banking is a regulatory obligation; forward it answers what a change will break, which is impact analysis; sideways it explains why two reports disagree. Capture it from code rather than documents so it cannot decay, and verify it the way you verify everything else: pick a number and walk the path yourself.

A report figure without lineage is a claim. With lineage, it is a conclusion.

Data Lineage: Trace Every Number Back to Its Source

Key takeaways

What does lineage actually look like?

What questions does lineage answer, and for whom?

How is lineage captured without decaying?

The takeaway

About the author

Data Lineage: Trace Every Number Back to Its Source

Key takeaways

What does lineage actually look like?

What questions does lineage answer, and for whom?

How is lineage captured without decaying?

The takeaway

About the author

Related articles

Subscribe