Unit-Testing In CI/CD
In the previous section, we learnt how to evaluate your LLM application using the evaluate()
function. Unit-testing LLM applications in CI/CD pipelines isn't so different, and it simply involves moving your previous workflow to Pytest-like test files and YAML
files to run evaluations in CI/CD pipelines.
Hence, it is extremely easy to setup and you can definitely reuse a lot of code from previous sections.
deepeval
originally got traction on GitHub due to its "Pytest for LLMs" positioning. No other framework does unit-testing as well as deepeval
, which means you're in for a treat when using Confident AI in CI/CD pipelines.
We actually kind of covered how to do unit-testing in the datasets section when showing how to use datasets in CI/CD pipelines, but this page is much more comprehensive and complete than the other.
Create Your Test File
First, let's create a test file, test_llm_app.py
, and note that we'll be reusing a lot of code from previous sections.
Your test file must start with test_
, that's just how things are with Pytest.
import hypothetical_chatbot
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval import assert_test
# Use dataset from previous sections
dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset")
# Create a list of LLMTestCase
test_cases = []
for golden in dataset.goldens:
input = golden.input
# Replace with your own LLM application
actual_output, retrieval_context = hypothetical_chatbot(input)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
test_cases.append(test_case)
# Make dataset ready for evaluation
dataset.test_cases = test_cases
@pytest.mark.parametrize("test_case", dataset)
def test_llm_app(test_case: LLMTestCase):
assert_test(test_case, [AnswerRelevancyMetric(), FaithfulnessMetric()])
Never ever use the evaluate()
function within a test function! Stick to assert_test()
as the evaluate()
function wasn't built for unit-testing in CI/CD, and you'll miss out on a lot of features assert_test()
offers that are specific for CI/CD, and introduce a lot of bugs into your codebase.
You can learn more about how to use assert_test()
here.
You should sanity check yourself first by running deepeval test run
once before going onto the next step. To execute test_llm_app.py
, run this command in the CLI:
deepeval test run test_llm_app.py -n 2
The -n
flag behind deepeval test run
is an optional flag that allows you to spin up multiple processes to run assert_test()
on multiple test cases at once, and is especially useful if you want to speed up the unit-testing process. To see the full list of flags available, click here.
You should see the same test run being created on Confident AI.
Setup Your YAML File
To execute your test file in CI/CD pipelines, simply create a YAML
file that runs deepeval test run
on push and pull requests.
name: LLM Unit/Regression Testing
on:
push:
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Install Dependencies
run: poetry install --no-root
- name: Login to Confident AI
env:
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"
- name: Run DeepEval Test Run
run: poetry run deepeval test run test_llm_app.py
Now each time you do a push or have a pull request on GitHub, your LLM application will be unit-tested.
Also note that you don't have to necessarily use poetry
for installation or follow each step exactly as presented. We're merely showing an example of how a sample yaml
file to execute a deepeval test run
would look like.
Don't forget to supply your Confident AI API Key as you'll not get access to your datasets or generate new testing reports on Confident AI otherwise.
Integrating With GitHub Actions
If you don't already have one, create a .github/workflows
directory in your repository and place your uni-testing.yml
YAML file there. Now, whenever you make a commit and push this change, GitHub Actions will automatically execute it based on the specified on triggers.
What About Logging Models, Prompts, And Others?
In the previous section, we also saw how we can log hyperparameters such as models and prompts in the evaluate()
function. To log it when using deepeval test run
for unit-testing, simply add this to your test file:
...
# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4o", prompt_template="...")
def hyperparameters():
# Return a dict to log additional hyperparameters.
# You can also return an empty dict {} if there's no additional parameters to log
return {
"temperature": 1,
"chunk size": 500
}
This allows deepeval
to associate evaluation results to these particular hyperparameters. Lastly, try deepeval test run
to check that it shows up on Confident AI:
deepeval test run test_llm_app.py
In the next section, we'll show how you can enable no-code workflows to run evaluations on Confident AI directly.