Running Evaluations
As you saw in the quickstart section earlier, a test run on Confident AI is created when you run an evaluation with deepeval
. A test run is made up of a list of test cases, containing the input
, actual_output
, etc. of your LLM application, and each with its own set of metric scores to represent the performance of your LLM application as a whole.
In short, a test run is a benchmark for your LLM application.
- Evaluating test cases locally on
deepeval
and sending it to Confident AI after evaluation. - Running an evaluation directly on Confident AI.
You can also think of LLM evaluation with Confident AI and deepeval
as unit-testing for LLMs.
Before you being, you definitely want to familiarize yourself with what a test case is as it is vital for LLM evaluation.
Running An Evaluation
To run an evaluation, the first step is to choose your LLM evaluation metrics, and you can see what's available here (custom metrics are available if non of deepeval
's pre-defined metrics resonate with your use case).
For an agentic evaluation use case, you can also include LLMTestCase
parameters such as tools_called
to encapsulate what tools were used during the generation of each actual_output
. For this example though, since we're just evaluating a RAG-based medical chatbot with no function/tool calling, we're just going to populate the input
, actual_output
, and retrieval_context
for our LLMTestCase
.
For the sake of simplicity, we'll reuse the example in quickstart for demontration purposes, which uses the AnswerRelevancyMetric
and FaithfulnessMetric
, which requires the input
, actual_output
, and retrieval_context
for each LLMTestCase
.
Also, don't forget to login to Confident AI:
deepeval login
import hypothetical_chatbot
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval import evaluate
# Use dataset from last section
dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset")
# Create a list of LLMTestCase
test_cases = []
for golden in dataset.goldens:
input = golden.input
# Replace with your own LLM application
actual_output, retrieval_context = hypothetical_chatbot(input)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
test_cases.append(test_case)
# Define metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.5)
faithfulness = FaithfulnessMetric(threshold=0.5)
# Run evaluation
evaluate(test_cases=test_cases, metrics=[answer_relevancy, faithfulness], identifier="First Run")
You'll notice that unlike the quickstart guide, we actually added an identifier
argument for this evaluate()
call. The identifier
is a string which you can supply to identify test runs more easily, because otherwise all you'll be seeing on Confident AI is gibberish test run IDs we've generated for you.
To learn about all the parameters the evaluate()
accepts, click here.
As you'll see on Confident AI, your second test case is actually failing for the FaithfulnessMetric
metric - that's fine, we're going to fix it.
If you click on the View test case details button, you'll be able to see the reasons why the FaithfulnessMetric
metric failed, as well as the verbose_logs
which details how the metric was calculated.
Running Another Evaluation
For now, lets pretend that we've improved our medical chatbot and here is the newly generated actual_output
for the second LLMTestCase
after one iteration:
"""If you cut your finger deeply, rinse it with soap and water, apply pressure to stop bleeding,
and cover it with a clean bandage. Seek medical care if it keeps bleeding, is very deep, or shows
signs of infection like redness or swelling. Ensure your tetanus shot is up to date."""
Go ahead and replace the second actual_output
with the latest one above, and run evaluations again:
import hypothetical_chatbot
from deepeval import evaluate
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
# Use dataset from last section
dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset")
# Create a list of LLMTestCase
test_cases = []
for golden in dataset.goldens:
input = golden.input
# Don't forget to hardcode the second actual_output!
actual_output, retrieval_context = hypothetical_chatbot(input)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
test_cases.append(test_case)
# Define metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.5)
faithfulness = FaithfulnessMetric(threshold=0.5)
# Run evaluation
evaluate(test_cases=test_cases, metrics=[answer_relevancy, faithfulness], identifier="Second Run")
Don't forget to update your identifier
and congratulations 🎉! All your test cases are now passing with flying colors.
Comparing Two Test Runs
Now that you've created two test runs, (identifier
="First Run" and identifier
="Second Run") - which are benchmarks of your LLM application, you can compare test results side by side to identify regressions and improvements.
In this example, we only have two test cases, but this can be scaled to thousands of test cases that allows you to quickly identify how does your most recent changes to your LLM application affect its performance.
In your individual Test Run page, navigate to the Compare Test Results page through the side navigation drawer, and select a test run to compare against in the dropdown menu. The rows you when you open the dropdown are the test run IDs displayed for historical test runs, and if you've supplied an identifier
during evaluate()
you'll be able to easily identify which test run's which.
Confident AI groups your test cases by its name
, and so if you want to compare test runs legitimately you should supply a hard-coded name for the same LLMTestCase
s you want to compare side by side.
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(name="239874", ...)
You can learn more about labelling test cases here.
By selecting one of the historical test runs, you'll be able to see exactly how the metric scores differed across them, and hence working out if your latest change is an improvement or regression, and whether it caused any unacceptable breaking changes. Luckily for us, it didn't.
Testing Reports On Confident AI
All test runs created, on Confident AI or locally via deepeval
(ie. evaluate()
or deepeval test run
), are automatically sent to Confident AI for anyone in your organization to view, share, download, and comment on. Here are the list of things you can do with Confident AI's testing reports:
- Analyze metric score distributions, averages, and median scores.
- Inspect test cases, including parameters such as
actual_output
,retrieval_context
, andtools_called
, and their associated metric results including score, reasoning, errors (if any), and verbose logs (for debugging your evaluation model's chain of thought). - Filter for and select specific test cases for sharing.
- Download test cases as CSV or JSON (if you're running a conversational test run).
- Create a dataset from the test cases in a test run.
- Compare different test runs side by side.
- Create a public link to share with external stakeholders that might be interested in your LLM evaluation process.
Confident AI also keeps track of additional metadata such as time taken for a test run to finish executing, and the evaluation cost associated with this test run.
Logging Models, Prompts, And More
In the previous sections, we saw how to create a test run by dynamically generating actual_output
for pulled datasets at evaluation time. But in addition to that, in order to experiemnt with different hyperparameter values to iterate towards the optimal LLM to use for your LLM application for example, you have to associate hyperparameter values to test runs to compare and pick the best hyperparameters on Confident AI.
Continuing from the previous example, if for example llm_app
uses gpt-4o
as its LLM, you can associate the gpt-4o
value as such:
...
evaluate(
test_cases=test_cases,
metrics=[metric],
hyperparameters={"model": "gpt4o", "prompt template": "your-prompt-template-goes-here"}
)
You can run a heavily nested for loop to generate actual_output
s to get test run evaluation results for all hyperparameter combinations.
for model in models:
for prompt in prompts:
evaluate(..., hyperparameters={...})
This will help Confident AI associate which hyperparameter combination were used for each test run results, which you can ultimately easily filter for and visualize which performs best on Confident AI.
In the next section, we'll show how you can setup this workflow in CI/CD pipelines to unit-test your LLM application.