Hands On AI Agent Mastery Course

Hands On AI Agent Mastery Course

Advanced Architectures for Vertical AI Agents

Lesson 69: Agent Observability & Logging

May 20, 2026
∙ Paid

Highlights

What we build

  • An ObservableAgent wrapping Gemini 2.0 Flash where every execution step — prompt construction, LLM call, post-processing — is a named OTEL span with semantic attributes

  • A SQLiteSpanExporter that persists spans to a WAL-mode SQLite database, replacing the need for an external collector during development

  • A metric_snapshots table recording per-trace latency and cost with configurable thresholds — the direct feed for L70’s alerting system

  • A FastAPI service exposing trace list, span waterfall, and percentile metrics over REST and WebSocket

  • A React dashboard with a Jaeger-style span waterfall, real-time cost tracker, and p50/p95/p99 latency display

Connection to L68
L68 introduced the drift_snapshots table and feature snapshot API. L69 extends the same SQLite WAL database — adding traces, spans, and metric_snapshots tables — so a single query can correlate agent performance degradation (drift signals from L68) with the specific LLM calls that caused it (spans from L69).

Enables L70
The metric_snapshots table carries threshold, breached, and trace_id columns. L70’s alerting engine will query this table, match breached rows to span attributes, and fire notifications without any new instrumentation work.

Hands On AI Agent Mastery Course is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.


Architecture Context

Place in the 90-lesson path
Lessons 61–68 built the MLOps control plane: CI/CD, model versioning, feature stores, continuous training, and drift detection. L69 closes the observability loop by making individual agent invocations inspectable. Without trace-level visibility, the drift signals from L68 are actionable only in aggregate — you know something degraded, but not which prompts, which spans, and what cost was involved.

Module 5 alignment
The module requires production-grade operability: the ability to diagnose problems quickly enough to meet enterprise SLAs. OTEL-based tracing is the industry-standard mechanism for this in distributed systems, and L69 adapts its primitives for the unique shape of LLM workloads (variable latency, token-denominated cost, prompt sensitivity).


User's avatar

Continue reading this post for free, courtesy of AI Agents Roadmap.

Or purchase a paid subscription.
© 2026 Systemdr, Inc. · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture