Skip to main content

Using Datasets For Evaluation

You can pull evaluation datasets created on Confident AI like how you would pull code from a GitHub repo by specifying the dataset alias, which the unique name of your dataset within a project.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset", auto_convert_goldens_to_test_cases=True)
print(dataset)

Once the dataset is pulled locally, you can start running evaluations using deepeval's metrics.

from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
...

evaluate(dataset, metrics=[AnswerRelevancyMetric()])
caution

Here we're just showing the simplest example. In reality, you will likely have to generate actual_outputs before running evaluations as the datasets you store on Confident AI will likely not have pre-computed actual_outputs.

When you pull a dataset from Confident AI, you are pulling raw goldens, not test cases. For those of you where your dataset does not have pre-computed actual_output, retrieval_context, and/or tools_called, you'll need an additional step to generate these parameters at evaluation time based on the input of each golden.

In this case, auto_convert_goldens_to_test_cases should be set to False (which is the default value).

Pull Your Dataset

Pull datasets from Confident by specifying its alias:

from deepeval.dataset import EvaluationDataset

# Initialize empty dataset object
dataset = EvaluationDataset()

# Pull from Confident
dataset.pull(alias="QA Dataset")

An EvaluationDataset accepts one mandatory and one optional argument:

  • alias: the alias of your dataset on Confident. A dataset alias is unique for each project.
  • [Optional] auto_convert_goldens_to_test_cases: When set to True, dataset.pull() will automatically convert all goldens that were fetched from Confident into test cases and override all test cases you currently have in your EvaluationDataset instance. Defaulted to False.
info

Essentially, auto_convert_goldens_to_test_cases is convenient if you have a complete, pre-computed dataset on Confident ready for evaluation. However, this might not always be the case. To disable the automatic conversion of goldens to test cases within your dataset, set auto_convert_goldens_to_test_cases to False. You might find this useful if you:

  • have goldens in your EvaluationDataset that are missing essential components, such as the actual_output, to be converted to test cases for evaluation. This may be the case if you're looking to generate actual_outputs at evaluation time. Remember, a golden does not require an actual_output, but a test case does.
  • have extra data preprocessing to do before converting goldens to test cases. Remember, goldens have an extra additional_metadata field, which is a dictionary that contains additional metadata for you to generate custom outputs.

Convert Goldens Into Test Cases

Here is an example of how you can use goldens as an intermediary variable to generate an actual_output before converting them to test cases for evaluation:

# Pull from Confident
dataset.pull(alias="QA Dataset")

Then, process the goldens and convert them into test cases:

# A hypothetical LLM application example
from chatbot import query
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
...

def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
test_cases = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
# Generate actual output using the 'input' and 'additional_metadata'
actual_output=query(golden.input, golden.additional_metadata),
expected_output=golden.expected_output,
context=golden.context,
)
test_cases.append(test_case)
return test_cases

# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)

Running Evaluations With Datasets

To use the dataset in the previous example, choose your metrics metric(s) as laid out in previous sections in this documentation and simply call the evaluate function:

from deepeval.metrics import HallucinationMetric
...

evaluate(dataset, metrics=[HallucinationMetric()])

You'll definitely want to reuse the same dataset to test a certain functionality of your LLM use case, as it is the only valid way of conducting experiments to determine which implementation of your LLM application performs best. For example, if you're building a Text-SQL, you'll probably need a different dataset for each different database you're building text-sql for.

tip

Don't forget to login, as you'll only be able to see evaluation results/test runs on Confident AI if you've supplied your Confident AI API Key.

deepeval login

The idea is for the same set of inputs, what are the actual_outputs that are being generated looks like, and based on these metric results, we'll be able to compare benchmarks to see which LLM implementation performs best.

In CI/CD Pipelines

You can also incorporate testing in CI/CD pipelines using deepeval's extensive Pytest integrations. You can also execute run a test run in CI/CD pipelines by executing deepeval test run to regression test your LLM application using the dataset you've defined on Confident AI.

note

In later sections, you'll also get the full version of how to unit-test in CI/CD pipelines, so feel free to skip this and save it for later.

Here, we'll be using GitHub Actions as an example and first let's create a test file, test_llm.py:

test_llm.py
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import assert_test

...
dataset = EvaluationDataset()
dataset.pull(alias="Your Dataset Alias")

# use the convert_goldens_to_test_cases function in the previous example
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)

@pytest.mark.parametrize("test_case", dataset)
def test_llm_app(test_case: LLMTestCase):
assert_test(test_case, [AnswerRelevancyMetric()])

Finally, create a YAML file that would be used in your CI/CD pipelines to run this file:

.github/workflows/regression.yml
name: LLM Regression Test

on:
push:
pull_request:

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Install Dependencies
run: poetry install --no-root

- name: Login to Confident AI
env:
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"

- name: Run DeepEval Test Run
run: poetry run deepeval test run test_llm_app.py
tip

Don't forget to also login to Confident AI, otherwise you wouldn't be able to pull your dataset, nor have testing reports on Confident AI.

Using Datasets on Confident AI

With a dataset, you can enable anyone to run evaluations directly on Confident AI in a click of a button, without needing to code. It involves a 4-step process:

  1. Create an experiment and specify which metrics you wish to run evaluations with.
  2. Specify which dataset you'll be using for evaluation.
  3. Setup and connect your LLM endpoint for Confident AI to prompt at evaluation time. Test cases will be used for evaluation based on the actual_output your LLM connection endpoint has returned.
  4. Press the evaluate button on the Evaluations page to start an evaluation.
info

We don't recommend you setting up this workflow unless you need to be able to run evaluations without touching code. There is nothing you can do by running evaluations on Confident AI that you cannot do with evaluating locally with deepeval.

Running evaluations with deepeval gives you more flexibility for metrics and customizations, but running evaluations on Confident AI unlocks evaluations for non-technical team members.