AI Observability's Dirty Secret
The blind spot: Perfect metrics, frustrated customers
Last week I was on a call with a VP Engineering at a B2B SaaS company. She had five observability tools running. LangFuse for tracing. Arize for model monitoring. A data warehouse. Custom dashboards. The works.
I asked her one question: “Which of your customers are frustrated with your AI?”
She couldn’t answer.
Not because she wasn’t technical. Not because her tools were broken. But because every single tool she had was built to answer the wrong question.
Here’s the thing nobody in AI observability wants to admit: the entire market is optimizing for metrics that don’t matter to the people who actually decide whether your AI product succeeds or fails.
Everyone’s Building the Same Thing
Open any AI observability tool right now. LangFuse, Arize, Raindrop, Keywords AI. They’re all focused on the same metrics:
Error rates
Latency
Token costs
Model performance
Prompt versions
These are engineering metrics. They tell you whether your system is running. They don’t tell you whether your customers are happy.
The VP I talked to could tell me her average latency down to the millisecond. She knew exactly how much each API call cost. She had alerts set up for every possible error condition.
But when I asked which of her top 20 enterprise accounts had a bad experience with their AI assistant yesterday, she had no idea. She’d find out when they stopped using it. Or worse, when they didn’t renew.
The Gap Nobody Talks About
You can have 99.9% uptime, sub-100ms latency, and perfect error handling. Your dashboards can be green across the board. And your biggest customer can still be quietly shopping your competitor because your AI keeps giving them useless responses.
Zero errors logged. Perfect uptime. Customer gone.
This is the blind spot.
I’ve had this conversation 50+ times in the last few months. Same pattern every time. Teams shipping AI to production. Spending thousands on observability. Flying completely blind on the one thing that matters: customer experience.
One founder told me: “We only know if customers don’t like our AI when they stop using it.” Another said: “There are inexplicable differences in our agent’s behavior and no way to know which customers were affected.”
These aren’t small companies with no budget. These are well-funded startups and enterprise teams with engineering resources. They’re doing everything “right” according to the current playbook. And they still can’t answer basic questions about customer experience.
Why This Happened
The AI observability market evolved from infrastructure monitoring. That’s the DNA.
Wave 1 was AI/ML monitoring. Model drift, data quality, training pipelines. Important stuff if you’re a data scientist.
Wave 2 was LLM observability. Prompts, tokens, latency, error rates. Critical if you’re an engineer debugging production issues.
Wave 3 should be customer experience. But nobody’s building it because everyone’s following the same playbook that worked for traditional observability.
The problem is: AI isn’t just infrastructure. It’s customer-facing. The thing that breaks isn’t always a system error. Sometimes it’s a perfectly functioning system giving a terrible answer. No error to log. No latency spike. Just a frustrated customer.
Engineering tools can’t see this. They weren’t designed to.
What Teams Actually Need
Here are the questions product teams ask me:
“Which customers are frustrated with our AI?”
“Which of our top 10 accounts had a bad experience yesterday?”
“What are customers actually asking our AI to do?”
“Is this issue affecting our enterprise buyers or just free users?”
Current observability tools can’t answer any of these. Not because they’re bad tools. They’re excellent at what they were designed to do. They just weren’t designed for this.
They don’t map interactions to customers. They don’t roll up to accounts. They don’t understand sentiment. They can’t tell you that 40% of your users at Microsoft (your biggest customer) got irrelevant responses last week.
Picture this scenario:
Your AI chatbot metrics look perfect. 99.5% success rate. 80ms average latency. Two cents per interaction. Your engineering team is happy. Your dashboards are green.
Meanwhile, half the users at your largest enterprise customer spent 15 minutes yesterday trying to get your AI to do something basic. It kept misunderstanding them. Gave generic responses. Made them repeat themselves.
They didn’t file a bug report. No errors were logged. But they walked away thinking your product doesn’t work. And now they’re in a meeting with your competitor.
This happens every day. Teams don’t find out until the customer stops using the feature. Or leaves entirely.
What’s Actually Needed
The solution isn’t better engineering metrics. It’s a different category entirely.
Customer-centric observability. Every AI interaction mapped to a customer. Rolled up by account. Sentiment tagged. Searchable by what customers are actually asking.
Product teams need to self-serve this data. No SQL. No waiting on engineering to pull reports. Visual, action-oriented insights they can use to fix problems before customers leave.
This doesn’t replace LLM observability. It complements it. Engineering still needs error rates and latency. But product teams need different answers. They need to know which customers are struggling. Which accounts are at risk. What’s working and what’s not from a customer perspective.
Both layers matter. Right now we only have one.
Why This Matters Now
Two years ago, most AI features were experiments. Nice to have. Teams could afford to be blind because the stakes were low.
That’s changing fast. AI is moving to production. Real customers. Real revenue. Real churn risk.
Enterprise buyers are starting to ask: “How do you monitor AI quality?” They want proof you can catch problems before they impact users. “We track error rates” doesn’t cut it anymore.
The question is shifting from “does your AI work?” to “do customers like your AI?”
In three years, asking “which customers are frustrated?” will seem as basic as asking “is the service up?” Right now, most teams can’t answer it.
That’s the gap. That’s the opportunity. And that’s what we’re building at Brixo.
More on that next week.
If you’re building AI observability tools and think I’m wrong, let me know. If you’re shipping AI and nodding along, reply - I want to hear your story.


