If you’re searching LangSmith vs Langfuse vs OpenTelemetry, you’re probably not asking for “better dashboards.” You’re asking a more painful question:
When my agent does something dumb (or expensive), how do I figure out why—and fix it without rerunning the entire workflow?
This post compares LangSmith, Langfuse, and OpenTelemetry (OTel) from the perspective of real agentic workflows: multi-step, tool-heavy, partially stateful, and prone to “it worked yesterday.”
Along the way, we’ll also cover a debugging model we use at nNode.ai: white-box workflows + artifact contracts + checkpoint replay. Tracing helps you find the problem. Artifacts and replay help you repair it.
The 2am questions observability must answer
Agent apps fail in a few repeatable ways. Your observability stack should answer these questions quickly:
- What did the agent see? (prompt, retrieved context, tool outputs)
- What did it decide? (tool selection, intermediate plans, routing decisions)
- What did it do? (tool calls, side effects, writes)
- Where did it diverge from expectations? (schema mismatch, prompt drift, bad retrieval)
- Can I re-run only the broken step? (or do I pay the full cost again?)
That last point is the difference between “debugging” and “operations.”
The 3 layers of agent observability (the model that actually works)
Most teams jump straight to tracing. In practice, you need three layers:
1) Artifacts (contracts): the what
Artifacts are explicit, named outputs of steps:
- Input artifact: what a step consumed
- Output artifact: what a step produced
- Schema: what “correct” looks like
If you can’t point to the exact JSON/Markdown an agent produced (and validate it), you don’t have a debuggable workflow—you have vibes.
2) Traces (timeline): the when and why
Tracing is the event stream:
- spans for LLM calls
- spans for tool calls
- timing, token/cost, errors
Traces are great at showing where time and money went—and which call caused the failure.
3) Metrics (SLOs): the how often
Metrics tell you whether the system is getting worse:
- error rate by step/tool
- p95 latency per workflow
- cost per successful run
- “human review required” rate
If you’re trying to run agent workflows daily, metrics are what keep you from slowly boiling alive.
LangSmith vs Langfuse vs OpenTelemetry: what each is really for
Here’s the clean mental model:
- OpenTelemetry: the standard plumbing (portable traces across your whole stack)
- LangSmith: a strong developer UX for LangChain/LangGraph-style runs and evaluation workflows
- Langfuse: an open, self-hostable LLM engineering platform built around tracing, prompts, evals, and labeling/feedback loops
But the devil is in the operational details.
LangSmith vs Langfuse vs OpenTelemetry (OTel): decision table
| Dimension | LangSmith | Langfuse | OpenTelemetry (OTel) |
|---|---|---|---|
| Best at | Debugging + evaluation loops for LangChain/LangGraph apps | Open-source LLM observability platform + prompt/eval workflows | End-to-end distributed tracing standard across services |
| Setup time | Fastest if you’re already in LangChain ecosystem | Fast, but you run/operate the platform if self-hosting | Medium: you’ll wire SDKs + collector + backend |
| Vendor lock-in | Medium (improving with OTel ingestion/export) | Lower (OTel-based, open source, self-hostable) | Lowest (standard) |
| Self-hosting | Available (often enterprise-style ops) | First-class | Not a product by itself |
| “Connect to rest of stack” (DB, queues, HTTP, workers) | Possible (OTel support helps), but LangSmith is LLM-centric | Good (also supports OTel and broader tracing concepts) | Best (this is literally the point) |
| Team workflows | Great for “share a run”, annotate, evaluate changes | Great for collaboration + labeling + prompt management | Depends on the backend you choose (Jaeger, Honeycomb, Datadog, Grafana, etc.) |
| Default recommendation | Great if you’re deep in LangChain/LangGraph and want immediate run-level debugging | Great if you want open source + self-hosting + broad integrations | Great as the foundation if you’re building serious multi-service systems |
The practical takeaway: OTel is the foundation, and LangSmith/Langfuse are often the “LLM-native UI layer” you add on top.
OpenTelemetry for agentic workflows: the underrated default
If your agent system touches real business systems (Stripe, HubSpot, Postgres, browser automation, queues), you eventually want traces that correlate:
- the user request
- the LLM call(s)
- the tool calls
- the DB writes
- the downstream worker jobs
That’s OpenTelemetry’s home turf.
Minimal OTel span model for agents
Even if you do nothing fancy, get these span types in place:
- workflow.run (root span)
- workflow.step (one per step)
- llm.inference (one per model call)
- tool.execute (one per tool invocation)
In Python (simplified):
from opentelemetry import trace
tracer = trace.get_tracer("agent-workflows")
with tracer.start_as_current_span("workflow.run", attributes={
"workflow.name": "seo_publisher",
"run.id": run_id,
"user.id": user_id,
}):
with tracer.start_as_current_span("workflow.step", attributes={
"step.name": "research",
"step.id": "step_01",
}):
# call LLM / tools here
pass
Where OTel falls short (by itself)
OTel won’t magically give you:
- prompt playgrounds
- run comparisons
- eval pipelines
- human annotation queues
That’s why teams often pair OTel with an LLM-native platform (LangSmith or Langfuse) or use OTel as the common format for exporting to their preferred backend.
LangSmith: fastest path to “why did the chain do that?”
LangSmith tends to shine when:
- you’re already using LangChain/LangGraph
- you need run-level debugging (inputs/outputs per node)
- you want evaluation loops and regression testing around prompts/chains
It’s also increasingly OTel-friendly, which matters if you don’t want your agent traces isolated from the rest of your system.
Example: send OpenTelemetry traces to LangSmith
pip install "langsmith[otel]"
export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.smith.langchain.com/otel"
export OTEL_EXPORTER_OTLP_HEADERS="x-api-key=<YOUR_LANGSMITH_API_KEY>"
If you’re running a self-hosted deployment, you typically update the endpoint to match your base URL.
Langfuse: open source LLM observability + prompt/eval workflows
Langfuse is compelling when:
- you want open source + self-hosting
- you care about data control (VPC/on-prem)
- you want tracing plus prompt management and eval/labeling workflows
It’s also OTel-based/compatible, which is a big deal for avoiding dead ends.
Example: model the “trace → observations → scores” loop
Even if you don’t adopt Langfuse immediately, the concept is right:
- Capture trace (what happened)
- Capture observation spans (LLM/tool/retrieval)
- Attach scores (quality, safety, correctness, user feedback)
That’s how you turn “debugging” into an improvement loop.
The nNode perspective: white-box first, dashboards second
Here’s the hard truth: observability won’t save a black-box workflow.
If your “agent” is a single mega-prompt that:
- plans,
- searches,
- calls tools,
- writes to DB,
- formats output,
- retries,
- and publishes…
…then your traces will be noisy and your failures will be expensive.
At nNode.ai, we take the opposite approach:
- One agent, one task
- Each step produces one explicit artifact
- Workflows checkpoint, so you can replay from the last good state
That changes the unit of debugging from “the whole run” to “the specific broken step.”
Artifact contracts: a concrete pattern
Define a JSON contract per step (even if the step outputs Markdown, store a structured envelope):
{
"artifact_type": "RESEARCH_SUMMARY",
"workflow": "seo_publisher",
"run_id": "run_9f2c",
"step_id": "research_01",
"source_urls": ["..."],
"claims": [{"text": "...", "evidence": "..."}],
"pii_redaction": {"applied": true}
}
Now you can answer:
- “Which input produced this bad claim?”
- “Was the retrieval empty?”
- “Did the tool return a partial response?”
Correlate artifacts with traces
The best practice that works with any stack:
- Put
run_id,step_id, andartifact_typeon every span as attributes - Put
trace_id/span_idin every artifact envelope
Pseudo-code:
span = trace.get_current_span()
ctx = span.get_span_context()
artifact["otel"] = {
"trace_id": format(ctx.trace_id, "032x"),
"span_id": format(ctx.span_id, "016x"),
}
Now you can click from a trace into the exact artifact, and from an artifact back into the exact trace.
When each option wins (stage-based recommendations)
Stage 0: solo founder, prototype, shipping daily
Pick the tool that minimizes friction:
- If you’re in LangChain/LangGraph: LangSmith is often the fastest time-to-value.
- If you want open source/self-host from day one: Langfuse.
But do not skip artifact contracts. They’re what keep your workflow fixable.
Stage 1: small team, multiple workflows, real customers
Default recommendation:
- OpenTelemetry as the underlying standard (so traces aren’t trapped)
- Add LangSmith or Langfuse for LLM-native UX (depending on ecosystem + data needs)
At this stage, you’ll also want:
- consistent step naming
- run/step IDs
- PII boundaries
- sampling rules (so costs don’t explode)
Stage 2: multi-service systems, compliance, data retention requirements
- OpenTelemetry becomes non-negotiable.
- Choose Langfuse (self-host) or LangSmith (self-host/enterprise) based on governance and team needs.
- Build a real evaluation pipeline with ground truth and regressions.
Minimal implementation checklist (works with any stack)
Use this as your “don’t regret it later” list:
- Name every step (stable identifiers, not human prose)
- Emit a step artifact even if it’s “empty” (so you can see the gap)
- Validate artifacts (schema + required fields)
- Attach cost + token usage at the LLM span level
- Trace tool side effects (writes, sends, publishes) as spans/events
- Add a replay path: “rerun step X with the same inputs”
- Redaction policy: don’t store secrets/PII in traces or artifacts by default
Debugging playbook: a realistic incident walkthrough
Scenario: your “publish blog post” agent published a draft with broken code blocks.
- Symptom: user reports formatting issue
- Find the run: filter by
workflow.name=seo_publisherandrun_id - Isolate the step: locate
step.name=render_mdx - Inspect the artifact: look at
MDX_DRAFTand validate schema - Fix without rerunning everything: replay only
render_mdxwith the sameRESEARCH_SUMMARYandOUTLINE - Prevent recurrence: add an automated check (Markdown fence validation) and fail the step early with a clear artifact
This is the difference between “we observed it” and “we repaired it.”
Summary: the recommended default in 2026
If you want a simple rule that won’t embarrass you later:
- Use OpenTelemetry as your default tracing format.
- Add LangSmith or Langfuse based on your ecosystem and governance needs.
- Design workflows to be white-box with artifact contracts and checkpoint replay—because that’s what turns debugging into operations.
If you’re building agentic workflows that touch real business systems and you want them to be easy to write, debug, and modify, that’s exactly what we’re building at nNode.ai.
Want to see what “artifact-first + replayable” looks like in practice? Take a look at nnode.ai and join the early access list.