Building Observability That Actually Works

A practical guide to turning AI conversations into business outcomes.

Dec 09, 2025

In my previous post I suggested that when it comes to AI, observability isn’t a tool you install. It’s a process you build. Most teams stop at data collection and wonder why they can’t answer basic questions about their AI products (or more specifically, the customers that use their products).

This post is about what to do instead.

The goal is simple: create a system where customer signals become organizational action fast enough to matter. Before frustrated customers churn. Before upsell opportunities go cold. Before product teams waste cycles building the wrong features.

Here’s how to get there.

Start With the Questions, Not the Tools

Most teams approach observability backwards. They install tools, collect data, and then ask “what can we learn from this?”

Flip it around. Start with the questions your organization needs to answer:

Which customers need intervention right now?
What use cases are succeeding and failing?
Where should product focus next?
Which accounts represent upsell opportunities?
What’s causing support volume?

These questions should drive everything: what data you collect, how you classify it, and where insights need to flow.

If you can’t trace a direct line from a piece of data to a business decision, you probably don’t need it. And if you need to answer a question but don’t have the data, that’s your gap. Also, tiny shout out, I know a product that is focused on this!

The Three Stages, Implemented

Stage 1: Collect (The Foundation)

Collection is the easy part, which is why most teams stop here.

The minimum viable collection layer captures:

Full conversation transcripts (not just summaries)
User and account identifiers
Timestamps and session metadata
Any context the AI had access to (retrieved documents, user history, etc.)

The key decision at this stage: store raw data, not just processed outputs. You don’t know yet what patterns you’ll need to find. Teams that only store aggregated metrics or AI-generated summaries lose the ability to go back and ask new questions.

Storage is cheap. Regret is expensive.

Where teams go wrong: Over-engineering the collection layer before understanding what they need. You don’t need a perfect data architecture on day one. You need raw conversations flowing somewhere accessible. Iterate from there.

Stage 2: Classify (Where Value Gets Created)

Classification turns raw conversations into structured, queryable insight. This is the stage most teams skip entirely, and it’s where the real leverage lives.

Classification answers three core questions about every conversation:

What was the customer trying to do? (Intent) This isn’t about topic modeling. It’s about understanding the job the customer was trying to accomplish. “Asking about billing” is a topic. “Trying to understand why they were charged twice” is an intent. The difference matters.

Did they succeed? (Outcome) Not “did the AI respond” but “did the customer get what they needed.” This requires inference. A customer who asks the same question three different ways and then leaves didn’t succeed, even if every response was technically accurate.

How did they feel about it? (Sentiment) Sentiment over the arc of a conversation, not just individual messages. A conversation that starts neutral and ends frustrated tells a different story than one that starts frustrated and ends resolved.

Implementation options:

Manual classification works for low volume. Have someone review conversations daily and tag them. This is slow but builds intuition about what patterns matter.

Rule-based classification works for known patterns. If a user sends more than three messages on the same topic, flag it as a potential failure. If certain keywords appear, tag the intent. This is brittle but fast to implement.

AI-powered classification scales. Run an agent over conversations that extracts intent, infers outcome, and tracks sentiment. This is where most teams should end up, but starting with manual classification helps you understand what “good” looks like before automating.

Where teams go wrong: Trying to classify everything. Start with the classifications that map directly to business questions. You can always add more dimensions later.

Stage 3: Activate (Where Results Happen)

Activation is the process of routing insights to the people who can act on them. This is where observability becomes organizational, not just technical.

The core principle: insights should flow to actors without requiring manual handoffs.

For product teams: Surface patterns in aggregate. What intents are failing most often? What use cases drive the most frustration? What questions does the AI struggle to answer?

This should be visible in a dashboard product can access directly, updated continuously. No waiting for engineering to pull reports.

For customer success: Surface account-level signals. Which accounts are showing declining sentiment? Which have users struggling with the same issues repeatedly? Which haven’t engaged with the AI at all (adoption risk)?

This should integrate with existing CS workflows. Alerts in Slack, flags in the CRM, or a dedicated queue of accounts needing attention.

For sales: Surface opportunity signals. Which accounts are asking about features they don’t have? Which are hitting usage limits? Which are exploring use cases adjacent to what they bought?

Same principle: insights should arrive where sales already works, not in a separate tool they have to remember to check.

For support: Surface incoming issues before they become tickets. If ten users at an account are struggling with the same thing, support should know before the first ticket arrives.

Where teams go wrong: Building dashboards nobody checks. Activation only works if insights arrive in the flow of work. A beautiful dashboard that requires a separate login and a reminder to check is a dashboard that won’t get checked.

Organizational Design

The “who owns this” question trips up most teams. Here’s a model that works:

Engineering owns collection. The infrastructure that captures and stores conversation data is a technical problem. Engineering should build it, maintain it, and ensure reliability.

Product owns classification. Defining what intents matter, what success looks like, and what sentiment signals are meaningful requires domain expertise. Product should own the classification schema and iterate on it as the product evolves.

Each function owns their activation. Customer success owns how insights flow into CS workflows. Sales owns how opportunities get surfaced. This isn’t a handoff from a central team. It’s each function defining what they need and ensuring they get it.

Someone owns the process. This is the piece most teams miss. Someone needs to own the end-to-end observability process: ensuring collection feeds classification, classification feeds activation, and the whole system is actually delivering business outcomes. This might be a product manager, a dedicated role, or an operations function. But someone has to own it.

Without clear ownership of the process, you’ll end up with a collection layer that doesn’t capture what classification needs, classifications that don’t map to what functions want, and activation that nobody trusts.

Measuring Success

How do you know if your observability process is working?

Time to insight: How long does it take to answer “which customers are struggling right now”? If the answer is days or weeks, the process is failing. If it’s minutes, you’re in good shape.

Insight to action latency: When a signal surfaces (frustrated account, upsell opportunity, product gap), how long until someone acts on it? Measure this. Shorten it.

Coverage: What percentage of conversations are classified? What percentage of accounts have health signals? Gaps in coverage are blind spots.

Outcome correlation: Do the signals predict outcomes? Do accounts flagged as “at risk” actually churn more often? Do “upsell opportunities” actually convert? If signals don’t correlate with outcomes, your classification is wrong.

The Minimum Viable Version

If this feels like a lot, here’s where to start:

Collect everything. Get full conversations flowing to storage. Don’t worry about perfect structure yet.
Classify manually for two weeks. Have someone review 20 conversations a day and tag intent, outcome, and sentiment. Build intuition about what patterns matter.
Pick one activation use case. Maybe it’s a daily Slack digest of frustrated customers for CS. Maybe it’s a weekly product report on failing use cases. Start with one loop that delivers value.
Automate classification. Once you know what “good” looks like, build or buy automated classification.
Expand activation. Add more functions, more signals, more integrations.

The goal isn’t to build the perfect system on day one. It’s to build a system that delivers value quickly and can evolve as you learn.

The Payoff

Teams that build real observability processes (not just observability tools) operate differently.

Product roadmaps are informed by actual customer struggles, not assumptions. Customer success intervenes before accounts go dark. Sales finds opportunities that would have been invisible. Support gets ahead of issues instead of reacting to them.

And most importantly: customers get better experiences because problems get fixed before frustration compounds.

That’s what observability should deliver. Not dashboards. Outcomes.

Discussion about this post

Ready for more?