Using Datasets For Evaluation
You can pull evaluation datasets created on Confident AI like how you would pull code from a GitHub repo by specifying the dataset alias
, which the unique name of your dataset within a project.
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset", auto_convert_goldens_to_test_cases=True)
print(dataset)
Once the dataset is pulled locally, you can start running evaluations using deepeval
's metrics.
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
...
evaluate(dataset, metrics=[AnswerRelevancyMetric()])
Here we're just showing the simplest example. In reality, you will likely have to generate actual_output
s before running evaluations as the datasets you store on Confident AI will likely not have pre-computed actual_output
s.
When you pull a dataset from Confident AI, you are pulling raw goldens, not test cases. For those of you where your dataset does not have pre-computed actual_output
, retrieval_context
, and/or tools_called
, you'll need an additional step to generate these parameters at evaluation time based on the input
of each golden.
In this case, auto_convert_goldens_to_test_cases
should be set to False
(which is the default value).
Pull Your Dataset
Pull datasets from Confident by specifying its alias
:
from deepeval.dataset import EvaluationDataset
# Initialize empty dataset object
dataset = EvaluationDataset()
# Pull from Confident
dataset.pull(alias="QA Dataset")
An EvaluationDataset
accepts one mandatory and one optional argument:
alias
: the alias of your dataset on Confident. A dataset alias is unique for each project.- [Optional]
auto_convert_goldens_to_test_cases
: When set toTrue
,dataset.pull()
will automatically convert all goldens that were fetched from Confident into test cases and override all test cases you currently have in yourEvaluationDataset
instance. Defaulted toFalse
.
Essentially, auto_convert_goldens_to_test_cases
is convenient if you have a complete, pre-computed dataset on Confident ready for evaluation. However, this might not always be the case. To disable the automatic conversion of goldens to test cases within your dataset, set auto_convert_goldens_to_test_cases
to False
. You might find this useful if you:
- have goldens in your
EvaluationDataset
that are missing essential components, such as theactual_output
, to be converted to test cases for evaluation. This may be the case if you're looking to generateactual_output
s at evaluation time. Remember, a golden does not require anactual_output
, but a test case does. - have extra data preprocessing to do before converting goldens to test cases. Remember, goldens have an extra
additional_metadata
field, which is a dictionary that contains additional metadata for you to generate custom outputs.
Convert Goldens Into Test Cases
Here is an example of how you can use goldens as an intermediary variable to generate an actual_output
before converting them to test cases for evaluation:
# Pull from Confident
dataset.pull(alias="QA Dataset")
Then, process the goldens and convert them into test cases:
# A hypothetical LLM application example
from chatbot import query
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
...
def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
test_cases = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
# Generate actual output using the 'input' and 'additional_metadata'
actual_output=query(golden.input, golden.additional_metadata),
expected_output=golden.expected_output,
context=golden.context,
)
test_cases.append(test_case)
return test_cases
# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)
Running Evaluations With Datasets
To use the dataset in the previous example, choose your metrics metric(s) as laid out in previous sections in this documentation and simply call the evaluate
function:
from deepeval.metrics import HallucinationMetric
...
evaluate(dataset, metrics=[HallucinationMetric()])
You'll definitely want to reuse the same dataset to test a certain functionality of your LLM use case, as it is the only valid way of conducting experiments to determine which implementation of your LLM application performs best. For example, if you're building a Text-SQL, you'll probably need a different dataset for each different database you're building text-sql for.
Don't forget to login, as you'll only be able to see evaluation results/test runs on Confident AI if you've supplied your Confident AI API Key.
deepeval login
The idea is for the same set of input
s, what are the actual_output
s that are being generated looks like, and based on these metric results, we'll be able to compare benchmarks to see which LLM implementation performs best.
In CI/CD Pipelines
You can also incorporate testing in CI/CD pipelines using deepeval
's extensive Pytest integrations. You can also execute run a test run in CI/CD pipelines by executing deepeval test run
to regression test your LLM application using the dataset you've defined on Confident AI.
In later sections, you'll also get the full version of how to unit-test in CI/CD pipelines, so feel free to skip this and save it for later.
Here, we'll be using GitHub Actions as an example and first let's create a test file, test_llm.py
:
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import assert_test
...
dataset = EvaluationDataset()
dataset.pull(alias="Your Dataset Alias")
# use the convert_goldens_to_test_cases function in the previous example
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)
@pytest.mark.parametrize("test_case", dataset)
def test_llm_app(test_case: LLMTestCase):
assert_test(test_case, [AnswerRelevancyMetric()])
Finally, create a YAML
file that would be used in your CI/CD pipelines to run this file:
name: LLM Regression Test
on:
push:
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Install Dependencies
run: poetry install --no-root
- name: Login to Confident AI
env:
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"
- name: Run DeepEval Test Run
run: poetry run deepeval test run test_llm_app.py
Don't forget to also login to Confident AI, otherwise you wouldn't be able to pull your dataset, nor have testing reports on Confident AI.
Using Datasets on Confident AI
With a dataset, you can enable anyone to run evaluations directly on Confident AI in a click of a button, without needing to code. It involves a 4-step process:
- Create an experiment and specify which metrics you wish to run evaluations with.
- Specify which dataset you'll be using for evaluation.
- Setup and connect your LLM endpoint for Confident AI to prompt at evaluation time. Test cases will be used for evaluation based on the
actual_output
your LLM connection endpoint has returned. - Press the evaluate button on the Evaluations page to start an evaluation.
We don't recommend you setting up this workflow unless you need to be able to run evaluations without touching code. There is nothing you can do by running evaluations on Confident AI that you cannot do with evaluating locally with deepeval
.
Running evaluations with deepeval
gives you more flexibility for metrics and customizations, but running evaluations on Confident AI unlocks evaluations for non-technical team members.