I was recently consulting with a founder building a lead enrichment pipeline. VC-funded, small team, new technology. The kind of company where everyone wears three hats and ships fast. That’s the value of startups: moving fast when bigger companies can’t.

The pipeline pulled GitHub profiles and matched them to LinkedIn accounts, cross-referencing company pages along the way. The model returned 0.78 confidence on a match.

The model gave us the wrong person. Same company. Same first name. Different person.

We needed to understand why.

The pushback surprised me. Not because they disputed the wrong match — they couldn’t — but because of what fixing it implied.

“Let’s rebuild the system. We will do better.”

“We need to hire ML researchers.”

“We just need to keep iterating.”

But none of that would help. We just didn’t know what guided the model to make those predictions. Were we feeding it the wrong context? Meanwhile, we needed it to be reliable and fast.

And here’s what I kept coming back to:

Could anyone explain why this match was supposedly correct?

Not “the model said 0.78 confidence.” Why 0.78? What did it weigh? What did it consider and reject? What evidence supported the decision?

Nobody could answer that, because the system wasn’t built to answer it. It was enabling us to do the wrong thing.

What was really being resisted, I think, wasn’t the idea of slowing down. It was the idea of confronting a harder question:

Can we measure and explain why our pipeline is predicting these results?

That question is really uncomfortable.
It implies the current system might be confidently wrong in ways we can’t see. It suggests that shipping faster won’t fix a visibility problem. It means admitting that “0.78 confidence” is a number, not an explanation.

This isn’t a new problem.

Finance Solved This Decades Ago

After Enron collapsed in 2001, the response wasn’t “we need better accountants.” It was Sarbanes-Oxley — a framework that said: every financial decision must leave a trail.

Even when I Venmo my buddy Brett $23.44 for pizza in Brooklyn, there’s a record. Corporate finance does this at scale: every journal entry traceable to source documents.

The principle is the same for AI. You can build the audit trail now — or wait until someone makes you.

We Build Dashboards Instead

We talk about Evals and Observability like they solve the problem.

Observability tells you what happened — latency, cost, throughput. Evals tell you how often you’re right — 85% accuracy on your test set.

Neither tells you why any individual decision was made.

And here’s the part nobody talks about: evals don’t lead to improvements. They surface the gap — “we’re wrong 15% of the time” — but they don’t give you the context to close it. You still need someone staring at failures, guessing what went wrong, manually rewriting prompts.

I see this in startups all the time. The pressure to ship is constant. So we build dashboards instead of audit trails. We measure what’s easy instead of what matters.

And I get it. Admitting your system might be confidently wrong — despite the impressive demo, despite the confidence scores — is uncomfortable. It’s easier to stay busy. Ship the next feature. Hope the problems shake out before the next board meeting.

So we tried something different.

What Would Have Helped

After the wrong match, we rebuilt how the pipeline made decisions and how it explained them.

For that same profile, the new system logged:

Evidence Span Trace

Lead Enrichment Pipeline

GitHub → LinkedIn Match · Traced Decision

Source: GitHub
Íñigo Montoya
Íñigo Montoya
@inigo

Software Engineer. Building at Guilder Inc. Passionate about revenge and sword fighting.

📍 Florin 📦 47 repos 👥 892
Candidates Evaluated
Íñigo Montoya
SAP Analytics Consultant · Freelance
× Rejected
Name match, company mismatch
Íñigo Ruiz
Managing Partner · Ruiz & Associates
× Rejected
Name partial match only
Íñigo Montoya Fernández
CTO · Guilder Inc
✓ Match
Name + company match
Evidence Breakdown
Company 95%
from Guilder Inc
to Guilder Inc
Name 85%
from Íñigo Montoya
to Íñigo Montoya Fernández
Role 45%
from Software Engineer
to CTO
Decision: Flagged for review

Strong name + company match. Role mismatch (Software Engineer → CTO) may reflect recent promotion or outdated GitHub bio. Flagged for human review.

An evidence span: not "0.78 confidence" — but why.

Not just “0.78 confidence” but “here are the 4 candidates, here’s what matched, here’s what didn’t, here’s the decision.” When the next wrong match happens, you can pull the thread.

The founder asked me later: “How do we know the good matches are actually good?”

You can’t trust the hits if you can’t explain the misses.

Auditability is the difference between knowing your system works and hoping it does.

I know this isn’t the first thing you’ll build.

There’s a difference between proof it works and proof it scales. The gap between them is the trail you can’t show yet.