llm eval explorations

Most people overcomplicate model evaluation.
If you’re building with LLMs, you don’t need a research lab.
You just need a way to measure whether your system behaves as expected.

This is the simple setup I use across my own products.

What I Measure

I only track a few things that directly affect users:

Accuracy – factual correctness
Context recall – especially for RAG
Tone and safety – is it something I’d ship
Latency and cost – practical trade-offs

That’s enough to catch drift and guide iteration.

Tools

DeepEval

DeepEval is a Python library that works like Pytest for LLMs.
It’s quick to script and easy to drop into CI.

from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
 
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris",
    context=["France's capital is Paris"]
)
metric = FaithfulnessMetric(threshold=0.7)
assert_test(test_case, [metric])

Run these tests automatically before deploys and track regressions over time.

Ragas

If your product uses retrieval, Ragas adds metrics for context relevance and hallucination.
It’s useful for debugging when your model starts citing the wrong documents.

Langfuse or LangSmith

For production, I log all prompts and outputs to Langfuse (self-hosted) or LangSmith.
This gives visibility into how the model performs on real traffic.
It’s tracing, not research.

My Workflow

Collect recent user queries and responses.
Run DeepEval and Ragas locally.
Review low scores and fix prompts or retrieval.
Re-run tests until consistent.
Monitor production traces in Langfuse.

Simple, repeatable, and fast to maintain.

Why This Works

Easy to automate
Works with any model
Scales with your product
Costs nothing to start

If you ever outgrow it, you can move to Arize or Humanloop.
But for most builders, this stack is enough.