The Auditability Gap

I was recently consulting with a founder building a lead enrichment pipeline. VC-funded, small team, new technology. The kind of company where everyone wears three hats and ships fast. That’s the value of startups: moving fast when bigger companies can’t.

The pipeline pulled GitHub profiles and matched them to LinkedIn accounts, cross-referencing company pages along the way. The model returned 0.78 confidence on a match.

The model gave us the wrong person. Same company. Same first name. Different person.

We needed to understand why.

Dense crowd at the 2024 NYC Marathon with a single teal balloon visible above the sea of faces — NYC Marathon, November 2024. Finding one person in a crowd of thousands. © Vicente Reig Rincón de Arellano. Shot on Kodak Gold 200.

The pushback surprised me. Not because they disputed the wrong match — they couldn’t — but because of what fixing it implied.

“Let’s rebuild the system. We will do better.”

“We need to hire ML researchers.”

“We just need to keep iterating.”

But none of that would help. We just didn’t know what guided the model to make those predictions. Were we feeding it the wrong context? Meanwhile, we needed it to be reliable and fast.

And here’s what I kept coming back to:

Could anyone explain why this match was supposedly correct?

Not “the model said 0.78 confidence.” Why 0.78? What did it weigh? What did it consider and reject? What evidence supported the decision?

Nobody could answer that, because the system wasn’t built to answer it. It was enabling us to do the wrong thing.

What was really being resisted, I think, wasn’t the idea of slowing down. It was the idea of confronting a harder question:

Can we measure and explain why our pipeline is predicting these results?

That question is really uncomfortable.
It implies the current system might be confidently wrong in ways we can’t see. It suggests that shipping faster won’t fix a visibility problem. It means admitting that “0.78 confidence” is a number, not an explanation.

This isn’t a new problem.

Finance Solved This Decades Ago

After Enron collapsed in 2001, the response wasn’t “we need better accountants.” It was Sarbanes-Oxley — a framework that said: every financial decision must leave a trail.

Even when I Venmo my buddy Brett $23.44 for pizza in Brooklyn, there’s a record. Corporate finance does this at scale: every journal entry traceable to source documents.

"The silent, immeasurable value of well-designed safeguards lies in the scandals they prevent from happening."

Sherron Watkins & Cynthia Cooper, the whistleblowers who exposed Enron and WorldCom — New York Times

The principle is the same for AI. You can build the audit trail now — or wait until someone makes you.

We Build Dashboards Instead

We talk about Evals and Observability like they solve the problem.

Observability tells you what happened — latency, cost, throughput. Evals tell you how often you’re right — 85% accuracy on your test set.

Neither tells you why any individual decision was made.

And here’s the part nobody talks about: evals don’t lead to improvements. They surface the gap — “we’re wrong 15% of the time” — but they don’t give you the context to close it. You still need someone staring at failures, guessing what went wrong, manually rewriting prompts.

I see this in startups all the time. The pressure to ship is constant. So we build dashboards instead of audit trails. We measure what’s easy instead of what matters.

And I get it. Admitting your system might be confidently wrong — despite the impressive demo, despite the confidence scores — is uncomfortable. It’s easier to stay busy. Ship the next feature. Hope the problems shake out before the next board meeting.

So we tried something different.

What Would Have Helped

After the wrong match, we rebuilt how the pipeline made decisions and how it explained them.

For that same profile, the new system logged:

Evidence Span Trace

Lead Enrichment Pipeline

GitHub → LinkedIn Match · Traced Decision

Source: GitHub

Íñigo Montoya

@inigo

Software Engineer. Building at Guilder Inc. Passionate about revenge and sword fighting.

📍 Florin 📦 47 repos 👥 892

Candidates Evaluated

Íñigo Montoya

SAP Analytics Consultant · Freelance

× Rejected

Name match, company mismatch

Íñigo Ruiz

Managing Partner · Ruiz & Associates

× Rejected

Name partial match only

Íñigo Montoya Fernández

CTO · Guilder Inc

✓ Match

Name + company match

Evidence Breakdown

Company 95%

from Guilder Inc

to Guilder Inc

Name 85%

from Íñigo Montoya

to Íñigo Montoya Fernández

Role 45%

from Software Engineer

to CTO

Decision: Flagged for review

Strong name + company match. Role mismatch (Software Engineer → CTO) may reflect recent promotion or outdated GitHub bio. Flagged for human review.

An evidence span: not "0.78 confidence" — but why.

Not just “0.78 confidence” but “here are the 4 candidates, here’s what matched, here’s what didn’t, here’s the decision.” When the next wrong match happens, you can pull the thread.

The founder asked me later: “How do we know the good matches are actually good?”

You can’t trust the hits if you can’t explain the misses.

Auditability is the difference between knowing your system works and hoping it does.

I know this isn’t the first thing you’ll build.

There’s a difference between proof it works and proof it scales. The gap between them is the trail you can’t show yet.