I was recently consulting with a founder building a lead enrichment pipeline. VC-funded, small team, new technology. The kind of company where everyone wears three hats and ships fast. That’s the value of startups: moving fast when bigger companies can’t.
The pipeline pulled GitHub profiles and matched them to LinkedIn accounts, cross-referencing company pages along the way. The model returned 0.78 confidence on a match.
The model gave us the wrong person. Same company. Same first name. Different person.
We needed to understand why.
The pushback surprised me. Not because they disputed the wrong match — they couldn’t — but because of what fixing it implied.
“Let’s rebuild the system. We will do better.”
“We need to hire ML researchers.”
“We just need to keep iterating.”
But none of that would help. We just didn’t know what guided the model to make those predictions. Were we feeding it the wrong context? Meanwhile, we needed it to be reliable and fast.
But here’s what I kept coming back to:
Could anyone explain why this match was supposedly correct?
Not “the model said 0.78.” Why 0.78? What did it weigh? What did it consider and reject? What evidence supported the decision?
Nobody could answer that, because the system wasn’t built to answer it. It was enabling us to do the wrong thing.
What was really being resisted, I think, wasn’t the idea of slowing down. It was the idea of confronting a harder question:
Can we measure and explain why our pipeline is predicting these results?
That question is really uncomfortable.
It implies the current system might be confidently wrong in ways we can’t see.
It suggests that shipping faster won’t fix a visibility problem.
It means admitting that “0.78 confidence” is a number, not an explanation.
This isn’t a new problem.
Finance Solved This Decades Ago
Remember Enron? In 2001, a $60 billion company collapsed because accounting firms policed themselves. They earned more by selling advice to clients than auditing them. The conflicts of interest allowed enormous deceptions to go unnoticed. Over 50,000 people lost their jobs.
The response wasn’t “we need better accountants.” It was Sarbanes-Oxley — a framework that said: every financial decision must leave a trail. It passed 423-3 in the House and 99-0 in the Senate.
Every transaction has a paper trail. Even when I Venmo my buddy Brett $23.44 for pizza in Brooklyn, there’s a record — timestamp, amount, who to who. Corporate finance just does this at scale: every journal entry traceable to source documents. If the numbers don’t match, you pull the thread until you find what broke.
As Sherron Watkins and Cynthia Cooper wrote in the New York Times earlier this year — the whistleblowers who exposed Enron and WorldCom — “systemic risks metastasize in regulatory gaps.”
"The silent, immeasurable value of well-designed safeguards lies in the scandals they prevent from happening."
The principle is the same for AI. When something’s wrong, you can find it. When something’s right, you can prove it.
We Build Dashboards Instead
We talk about Evals and Observability like they solve the problem. And they don’t.
Evals tell you the system is 85% accurate. They don’t tell you why any individual decision was made. Observability shows you costs and latency. It doesn’t show you the chain of reasoning that led to that prediction.
We’ve built systems that know they’re 85% accurate, but can’t explain the other 15%.
I see this in startups all the time. The pressure to ship is constant. Asking “can we explain our predictions?” feels like a luxury nobody can afford. The roadmap is full. Everyone’s plate is full.
Hard to Argue With That
In a certain light, “let’s rebuild” and “we need ML researchers” are reasonable responses. They’re ways to stay busy without confronting the harder question.
Keep busy. Show progress. Hope the problems shake out before the next board meeting.
And I get it. Admitting your system is fundamentally broken — despite the impressive demo, despite the confidence scores — is as hard as admitting that you and the person you love aren’t compatible to build a life together. The chemistry was real. But when the rubber hit the tarmac, you couldn’t explain why.
So we tried something different.
What Would Have Helped
After the wrong match, we rebuilt the pipeline to log evidence spans for every decision. Think of them as a record of planned decision checkpoints — what the model saw, what it considered, and why.
For that same profile, the new system logged (names anonymized obvi):
Lead Enrichment Pipeline
GitHub → LinkedIn Match · Traced Decision
Software Engineer. Building at Guilder Inc. Passionate about revenge and sword fighting.
Guilder Inc
Guilder Inc
Íñigo Montoya
Íñigo Montoya Fernández
Software Engineer
CTO
That’s the difference. Not just “0.78 confidence” but “here are the 4 candidates, here’s what matched, here’s what didn’t, here’s the decision.” When the next wrong match happens, you can pull the thread.
The founder asked me later: “How do we know the good matches are actually good?” The truth is that you can’t trust the hits if you can’t explain the misses.
Evidence spans build the trail you can trace every decision on. So when things go south, you will know why.
What would it take to make auditability feel like a feature, not a tax?