LangSmith vs Langfuse vs OpenTelemetry: What you actually need to debug agentic workflows

If you’re searching LangSmith vs Langfuse vs OpenTelemetry, you’re probably not asking for “better dashboards.” You’re asking a more painful question:

When my agent does something dumb (or expensive), how do I figure out why—and fix it without rerunning the entire workflow?

This post compares LangSmith, Langfuse, and OpenTelemetry (OTel) from the perspective of real agentic workflows: multi-step, tool-heavy, partially stateful, and prone to “it worked yesterday.”

Along the way, we’ll also cover a debugging model we use at nNode.ai: white-box workflows + artifact contracts + checkpoint replay. Tracing helps you find the problem. Artifacts and replay help you repair it.

The 2am questions observability must answer

Agent apps fail in a few repeatable ways. Your observability stack should answer these questions quickly:

What did the agent see? (prompt, retrieved context, tool outputs)
What did it decide? (tool selection, intermediate plans, routing decisions)
What did it do? (tool calls, side effects, writes)
Where did it diverge from expectations? (schema mismatch, prompt drift, bad retrieval)
Can I re-run only the broken step? (or do I pay the full cost again?)

That last point is the difference between “debugging” and “operations.”

The 3 layers of agent observability (the model that actually works)

Most teams jump straight to tracing. In practice, you need three layers:

1) Artifacts (contracts): the what

Artifacts are explicit, named outputs of steps:

Input artifact: what a step consumed
Output artifact: what a step produced
Schema: what “correct” looks like

If you can’t point to the exact JSON/Markdown an agent produced (and validate it), you don’t have a debuggable workflow—you have vibes.

2) Traces (timeline): the when and why

Tracing is the event stream:

spans for LLM calls
spans for tool calls
timing, token/cost, errors

Traces are great at showing where time and money went—and which call caused the failure.

3) Metrics (SLOs): the how often

Metrics tell you whether the system is getting worse:

error rate by step/tool
p95 latency per workflow
cost per successful run
“human review required” rate

If you’re trying to run agent workflows daily, metrics are what keep you from slowly boiling alive.

LangSmith vs Langfuse vs OpenTelemetry: what each is really for

Here’s the clean mental model:

OpenTelemetry: the standard plumbing (portable traces across your whole stack)
LangSmith: a strong developer UX for LangChain/LangGraph-style runs and evaluation workflows
Langfuse: an open, self-hostable LLM engineering platform built around tracing, prompts, evals, and labeling/feedback loops

But the devil is in the operational details.

LangSmith vs Langfuse vs OpenTelemetry (OTel): decision table

Dimension	LangSmith	Langfuse	OpenTelemetry (OTel)
Best at	Debugging + evaluation loops for LangChain/LangGraph apps	Open-source LLM observability platform + prompt/eval workflows	End-to-end distributed tracing standard across services
Setup time	Fastest if you’re already in LangChain ecosystem	Fast, but you run/operate the platform if self-hosting	Medium: you’ll wire SDKs + collector + backend
Vendor lock-in	Medium (improving with OTel ingestion/export)	Lower (OTel-based, open source, self-hostable)	Lowest (standard)
Self-hosting	Available (often enterprise-style ops)	First-class	Not a product by itself
“Connect to rest of stack” (DB, queues, HTTP, workers)	Possible (OTel support helps), but LangSmith is LLM-centric	Good (also supports OTel and broader tracing concepts)	Best (this is literally the point)
Team workflows	Great for “share a run”, annotate, evaluate changes	Great for collaboration + labeling + prompt management	Depends on the backend you choose (Jaeger, Honeycomb, Datadog, Grafana, etc.)
Default recommendation	Great if you’re deep in LangChain/LangGraph and want immediate run-level debugging	Great if you want open source + self-hosting + broad integrations	Great as the foundation if you’re building serious multi-service systems

The practical takeaway: OTel is the foundation, and LangSmith/Langfuse are often the “LLM-native UI layer” you add on top.

OpenTelemetry for agentic workflows: the underrated default

If your agent system touches real business systems (Stripe, HubSpot, Postgres, browser automation, queues), you eventually want traces that correlate:

the user request
the LLM call(s)
the tool calls
the DB writes
the downstream worker jobs

That’s OpenTelemetry’s home turf.

Minimal OTel span model for agents

Even if you do nothing fancy, get these span types in place:

workflow.run (root span)
workflow.step (one per step)
llm.inference (one per model call)
tool.execute (one per tool invocation)

In Python (simplified):

from opentelemetry import trace

tracer = trace.get_tracer("agent-workflows")

with tracer.start_as_current_span("workflow.run", attributes={
    "workflow.name": "seo_publisher",
    "run.id": run_id,
    "user.id": user_id,
}):
    with tracer.start_as_current_span("workflow.step", attributes={
        "step.name": "research",
        "step.id": "step_01",
    }):
        # call LLM / tools here
        pass

Where OTel falls short (by itself)

OTel won’t magically give you:

prompt playgrounds
run comparisons
eval pipelines
human annotation queues

That’s why teams often pair OTel with an LLM-native platform (LangSmith or Langfuse) or use OTel as the common format for exporting to their preferred backend.

LangSmith: fastest path to “why did the chain do that?”

LangSmith tends to shine when:

you’re already using LangChain/LangGraph
you need run-level debugging (inputs/outputs per node)
you want evaluation loops and regression testing around prompts/chains

It’s also increasingly OTel-friendly, which matters if you don’t want your agent traces isolated from the rest of your system.

Example: send OpenTelemetry traces to LangSmith

pip install "langsmith[otel]"

export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.smith.langchain.com/otel"
export OTEL_EXPORTER_OTLP_HEADERS="x-api-key=<YOUR_LANGSMITH_API_KEY>"

If you’re running a self-hosted deployment, you typically update the endpoint to match your base URL.

Langfuse: open source LLM observability + prompt/eval workflows

Langfuse is compelling when:

you want open source + self-hosting
you care about data control (VPC/on-prem)
you want tracing plus prompt management and eval/labeling workflows

It’s also OTel-based/compatible, which is a big deal for avoiding dead ends.

Example: model the “trace → observations → scores” loop

Even if you don’t adopt Langfuse immediately, the concept is right:

Capture trace (what happened)
Capture observation spans (LLM/tool/retrieval)
Attach scores (quality, safety, correctness, user feedback)

That’s how you turn “debugging” into an improvement loop.

The nNode perspective: white-box first, dashboards second

Here’s the hard truth: observability won’t save a black-box workflow.

If your “agent” is a single mega-prompt that:

plans,
searches,
calls tools,
writes to DB,
formats output,
retries,
and publishes…

…then your traces will be noisy and your failures will be expensive.

At nNode.ai, we take the opposite approach:

One agent, one task
Each step produces one explicit artifact
Workflows checkpoint, so you can replay from the last good state

That changes the unit of debugging from “the whole run” to “the specific broken step.”

Artifact contracts: a concrete pattern

Define a JSON contract per step (even if the step outputs Markdown, store a structured envelope):

{
  "artifact_type": "RESEARCH_SUMMARY",
  "workflow": "seo_publisher",
  "run_id": "run_9f2c",
  "step_id": "research_01",
  "source_urls": ["..."],
  "claims": [{"text": "...", "evidence": "..."}],
  "pii_redaction": {"applied": true}
}

Now you can answer:

“Which input produced this bad claim?”
“Was the retrieval empty?”
“Did the tool return a partial response?”

Correlate artifacts with traces

The best practice that works with any stack:

Put run_id, step_id, and artifact_type on every span as attributes
Put trace_id / span_id in every artifact envelope

Pseudo-code:

span = trace.get_current_span()
ctx = span.get_span_context()

artifact["otel"] = {
  "trace_id": format(ctx.trace_id, "032x"),
  "span_id": format(ctx.span_id, "016x"),
}

Now you can click from a trace into the exact artifact, and from an artifact back into the exact trace.

When each option wins (stage-based recommendations)

Stage 0: solo founder, prototype, shipping daily

Pick the tool that minimizes friction:

If you’re in LangChain/LangGraph: LangSmith is often the fastest time-to-value.
If you want open source/self-host from day one: Langfuse.

But do not skip artifact contracts. They’re what keep your workflow fixable.

Stage 1: small team, multiple workflows, real customers

Default recommendation:

OpenTelemetry as the underlying standard (so traces aren’t trapped)
Add LangSmith or Langfuse for LLM-native UX (depending on ecosystem + data needs)

At this stage, you’ll also want:

consistent step naming
run/step IDs
PII boundaries
sampling rules (so costs don’t explode)

Stage 2: multi-service systems, compliance, data retention requirements

OpenTelemetry becomes non-negotiable.
Choose Langfuse (self-host) or LangSmith (self-host/enterprise) based on governance and team needs.
Build a real evaluation pipeline with ground truth and regressions.

Minimal implementation checklist (works with any stack)

Use this as your “don’t regret it later” list:

Name every step (stable identifiers, not human prose)
Emit a step artifact even if it’s “empty” (so you can see the gap)
Validate artifacts (schema + required fields)
Attach cost + token usage at the LLM span level
Trace tool side effects (writes, sends, publishes) as spans/events
Add a replay path: “rerun step X with the same inputs”
Redaction policy: don’t store secrets/PII in traces or artifacts by default

Debugging playbook: a realistic incident walkthrough

Scenario: your “publish blog post” agent published a draft with broken code blocks.

Symptom: user reports formatting issue
Find the run: filter by workflow.name=seo_publisher and run_id
Isolate the step: locate step.name=render_mdx
Inspect the artifact: look at MDX_DRAFT and validate schema
Fix without rerunning everything: replay only render_mdx with the same RESEARCH_SUMMARY and OUTLINE
Prevent recurrence: add an automated check (Markdown fence validation) and fail the step early with a clear artifact

This is the difference between “we observed it” and “we repaired it.”

Summary: the recommended default in 2026

If you want a simple rule that won’t embarrass you later:

Use OpenTelemetry as your default tracing format.
Add LangSmith or Langfuse based on your ecosystem and governance needs.
Design workflows to be white-box with artifact contracts and checkpoint replay—because that’s what turns debugging into operations.

If you’re building agentic workflows that touch real business systems and you want them to be easy to write, debug, and modify, that’s exactly what we’re building at nNode.ai.

Want to see what “artifact-first + replayable” looks like in practice? Take a look at nnode.ai and join the early access list.