Insight·Dec 20, 2025

Data Lineage: Why 'Where Did This Number Come From?' Is Hard to Answer

The real reason stakeholders distrust data reports, and how to fix it.

The Question That Breaks Trust

"Where did this number come from?"

It's the question every analyst dreads. Not because you don't know—you probably do, somewhere in your memory—but because explaining it is surprisingly hard. The honest answer often involves a SQL query you wrote three weeks ago, a notebook you've modified several times since then, some manual adjustments you made in Excel, and a filter you may or may not have applied consistently.

Data lineage—the chain of custody from raw data to final insight—is the hardest thing to maintain in analytical work. When it breaks, trust breaks with it. Stakeholders don't stop using your numbers, but they start adding mental asterisks. They ask for double-checks more often. They bring their own calculations to meetings. The erosion is gradual but real.

Why Data Lineage Is So Difficult

The fundamental problem is that analysis is iterative. You don't write code once, run it, and ship the output. You explore. You iterate. You backtrack when something doesn't look right. You refine when you learn something new. By the time you're done, the notebook has been through dozens of versions, and you've probably run cells out of order more times than you can count.

Which version produced the chart that ended up in the report? That's not a question most analytical tools help you answer. The notebook knows its current state, but it doesn't remember the state it was in when you copied that specific output. The report knows it has a chart, but it doesn't know anything about where that chart came from.

This gets worse when you consider that reports are fundamentally disconnected from notebooks. They live in different systems—Jupyter in one place, Google Slides or Notion in another. The moment you screenshot a chart and paste it into a slide, you've severed the connection. That chart is now just pixels. No metadata. No link to the source. No way to trace back.

Version control helps with code, but it doesn't solve the lineage problem. Git tracks every change to your notebook, which is great. But Git doesn't track which version of the notebook produced which version of the report. And it certainly doesn't help stakeholders who don't use Git and wouldn't know how to find a specific commit even if they did.

Then there's the data itself. The dataset you analyzed last week might not be the same as the dataset today. If someone adds rows, fixes errors, or backfills missing values, your notebook will produce different results the next time it runs. The numbers in the report are now orphans—they came from a version of reality that no longer exists in the system.

What Good Data Lineage Looks Like

Good data lineage answers three questions. First, what data was used? Not just "the customer table" but the specific dataset, with whatever filters were applied. Second, what code produced the output? The exact notebook and cell, in the state they were when the output was generated. Third, when did this happen? A timestamp that lets you understand the context and check what's changed since.

The key is that this chain should be unbroken: from a claim in a stakeholder-facing report, all the way back to the specific dataset that underlies it. Anyone should be able to trace that path without asking you.

With those three pieces of information, anyone can audit a finding. They can pull the same data, run the same code, and verify the result. They can understand the methodology without having to ask you. They can trace a conclusion back to its assumptions.

This isn't about perfect reproducibility in the scientific sense. Data changes. Business contexts evolve. The goal isn't to freeze everything in amber. The goal is traceability—knowing enough to investigate when questions arise, and having a clear path from any claim back to its source.

How Teams Fake It

Without proper tooling, teams develop workarounds. They're not elegant, but they sort of work until they don't.

The most common approach is documentation—adding notes to charts, keeping a changelog, writing methodology sections that explain where numbers came from. This works as long as the documentation stays current, which it often doesn't. Documentation is the first thing that slips when you're rushing to meet a deadline.

Some teams use naming conventions. The analysis file is called customer_churn_v3_final_FINAL.ipynb and the report references that name. This creates a loose connection, but it breaks as soon as someone renames a file or creates a v4 without updating all the references.

Email chains serve as an informal audit trail. "See my analysis from November 15th" at least gives someone a starting point for the search. But email is a terrible system of record. Messages get deleted, search fails, and good luck finding that one attachment from eight months ago.

The worst but most common approach is tribal knowledge. "Ask Sarah, she did that analysis." This works until Sarah goes on vacation, changes jobs, or simply forgets. The lineage is locked in someone's head instead of embedded in the artifacts.

The Alternative: Built-In Data Lineage

Instead of documenting lineage manually, you can build it into the artifacts themselves. When you embed a chart in a Margin brief, the system records which notebook cell created it, which dataset was used, and when. The connection stays intact. Anyone with access to the brief can trace any claim back through the code to the underlying data—without asking you.

This changes the dynamic entirely. When a stakeholder asks "where did this number come from?" the answer isn't a scavenger hunt. It's a click. They can see the code, see the data reference, and understand the methodology without scheduling a meeting to ask you about it.

Updates become cleaner too. When you change the analysis and re-run the notebook, the brief can show the updated outputs. There's no manual copy-paste step where the connection breaks. The lineage survives the update because it was never severed in the first place.

Building a Culture of Traceability

Tools help, but they're not sufficient on their own. Good data lineage also requires some discipline around how work gets done.

Link instead of copying whenever possible. Every time you screenshot a chart and paste it somewhere, you're creating a lineage gap. If your tools support embedding or linking, use that instead. The extra few seconds are worth it.

Document your assumptions explicitly, especially the ones that aren't obvious from the code. What time range did you use? What did you exclude and why? What would make you question these results? These notes are cheap insurance against future confusion.

Version your datasets when you can. If you're analyzing a snapshot, name it with a date. If you're pulling from a live system, record the query timestamp. Immutable references are better than "the latest data" when you need to understand what you were looking at.

Make your code accessible to the people who might need to verify it. This doesn't mean everyone needs to read Python. It means the code should be findable, runnable, and documented enough that someone who does know Python can follow along.

Why This Matters More Now

AI is making it easier to generate analysis. You can ask an LLM to create a chart, summarize a dataset, or answer questions about your data. This is genuinely useful, but it makes provenance more important than ever.

When a human writes analysis, there's at least an implicit chain of reasoning. You know someone thought about the data, made choices, and produced an output. When AI generates analysis, that chain can be invisible. The output appears fully formed with no obvious trail back to its source.

Blind trust in AI-generated numbers is dangerous. So is blind trust in human-generated numbers, for that matter. Data lineage is the antidote to blind trust. It's the evidence that real work happened with real data, and that the path from any claim in a report can be traced all the way back to the source dataset.

Margin briefs maintain data lineage automatically. Every chart traces back to the code and the dataset that made it. Try it free.

Pandas Visualization That Actually Communicates

Stop making charts for yourself. Here's how to make charts that stakeholders understand.

Cloud Notebooks Compared: JupyterHub vs. Colab vs. Hex vs. Margin

A practical breakdown of when to use which cloud notebook platform.