I was recently consulting with a founder building a lead enrichment pipeline. VC-funded, small team, new technology. The kind of company where everyone wears three hats and ships fast. That’s the value of startups: moving fast when bigger companies can’t.
The pipeline pulled GitHub profiles and matched them to LinkedIn accounts, cross-referencing company pages along the way. The model returned 0.78 confidence on a match.
The model gave us the wrong person. Same company. Same first name. Different person.
We needed to understand why.
The pushback surprised me. Not because they disputed the wrong match — they couldn’t — but because of what fixing it implied.
“Let’s rebuild the system. We will do better.”
“We need to hire ML researchers.”
“We just need to keep iterating.”
But none of that would help. We just didn’t know what guided the model to make those predictions. Were we feeding it the wrong context? Meanwhile, we needed it to be reliable and fast.
And here’s what I kept coming back to:
Could anyone explain why this match was supposedly correct?
Not “the model said 0.78 confidence.” Why 0.78? What did it weigh? What did it consider and reject? What evidence supported the decision?
Nobody could answer that, because the system wasn’t built to answer it. It was enabling us to do the wrong thing.
What was really being resisted, I think, wasn’t the idea of slowing down. It was the idea of confronting a harder question:
Can we measure and explain why our pipeline is predicting these results?
That question is really uncomfortable.
It implies the current system might be confidently wrong in ways we can’t see.
It suggests that shipping faster won’t fix a visibility problem.
It means admitting that “0.78 confidence” is a number, not an explanation.
This isn’t a new problem.
Finance Solved This Decades Ago
After Enron collapsed in 2001, the response wasn’t “we need better accountants.” It was Sarbanes-Oxley — a framework that said: every financial decision must leave a trail.
Even when I Venmo my buddy Brett $23.44 for pizza in Brooklyn, there’s a record. Corporate finance does this at scale: every journal entry traceable to source documents.
"The silent, immeasurable value of well-designed safeguards lies in the scandals they prevent from happening."
The principle is the same for AI. You can build the audit trail now — or wait until someone makes you.
We Build Dashboards Instead
We talk about Evals and Observability like they solve the problem.
Observability tells you what happened — latency, cost, throughput. Evals tell you how often you’re right — 85% accuracy on your test set.
Neither tells you why any individual decision was made.
And here’s the part nobody talks about: evals don’t lead to improvements. They surface the gap — “we’re wrong 15% of the time” — but they don’t give you the context to close it. You still need someone staring at failures, guessing what went wrong, manually rewriting prompts.
I see this in startups all the time. The pressure to ship is constant. So we build dashboards instead of audit trails. We measure what’s easy instead of what matters.
And I get it. Admitting your system might be confidently wrong — despite the impressive demo, despite the confidence scores — is uncomfortable. It’s easier to stay busy. Ship the next feature. Hope the problems shake out before the next board meeting.
So we tried something different.
What Would Have Helped
After the wrong match, we rebuilt how the pipeline made decisions and how it explained them.
For that same profile, the new system logged:
Lead Enrichment Pipeline
GitHub → LinkedIn Match · Traced Decision
Software Engineer. Building at Guilder Inc. Passionate about revenge and sword fighting.
Guilder Inc
Guilder Inc
Íñigo Montoya
Íñigo Montoya Fernández
Software Engineer
CTO
Not just “0.78 confidence” but “here are the 4 candidates, here’s what matched, here’s what didn’t, here’s the decision.” When the next wrong match happens, you can pull the thread.
The founder asked me later: “How do we know the good matches are actually good?”
You can’t trust the hits if you can’t explain the misses.
Auditability is the difference between knowing your system works and hoping it does.
I know this isn’t the first thing you’ll build.
There’s a difference between proof it works and proof it scales. The gap between them is the trail you can’t show yet.