Creating Scenario Tests

A Validate ScenarioTest is a way to monitor model performance in critical scenarios. Each test is defined on a subset (Slice) of data and can consist of multiple evaluation metrics. Within a test user can either compare the model performance against other baseline models (e.g. on which metrics is model X better than model Y) or hard thresholds (e.g. is the IoU of model X > 0.8 on the slice of interest).

In this guide we'll create a scenario test on a fictional Slice with several evaluation functions and walk through how baseline models and pass/fail thresholds can be set.

Interacting with Validate SDK

You can use the Validate Python SDK via the NucleusClient.validate module. You set up the SDK exactly like you would when interacting with Nucleus (see the Getting Started section). You can list existing scenario tests (will be empty in the beginning) and available evaluation functions.

import nucleus

client = nucleus.NucleusClient(YOUR_SCALE_API_KEY)

existing_tests = client.validate.scenario_tests

eval_functions = client.validate.eval_functions

Creating a new ScenarioTest

Selecting the slice

We select the Slice we want as the basis of our ScenarioTest data as we would normally select slices from the SDK. You can also find the Slice ID by viewing the slice in the UI and copying the slc_... ID from the URL.

# NOTE: This slice does not exist, please update with a valid 
#  slice ID from your dataset
pedestrians_slice = client.get_slice("slc_c2dfzaxyr4kh0na1ms")

Alternatively, you can list the slices associated with a given dataset.

# NOTE: This dataset does not exist, please update with a valid dataset ID
dataset = client.get_dataset("ds_c6k9faxtz45009103xz0")

Selecting the EvaluationFunction

Validate comes with a growing set of standard evaluation functions which are listed on the AvailableEvaluationFunctions object returned by validate.eval_functions.public_functions. These can all be found as members of the client.validate.eval_functions object. If the public evaluation functions don't satisfy your needs and you need to define more private evaluation functions, please contact us and we'll help you to get the private evaluation functions set up.

We currently support the following list of evaluation functions:

  • 2D object detection: bounding box precision, recall, IOU, mAP
  • 3D object detection: cuboid precision, recall, IOU (both 3D and birds-eye-view 2D)
  • Image categorization: Categorical F1 score

Defining a ScenarioTest

We finally create the ScenarioTest with the pedestrians slice and the criterion for the IOU and mean average precision evaluation functions. We add the two additional metrics in a subsequent step. Note, that all of them could be added at once in the list. Each test needs to at least contain one evaluation function upon instantiation.

scenario_test = client.validate.create_scenario_test(
    name="Pedestrians on a crosswalk",, 


Once setup, the evaluations on the test can be easily run as described in the Evaluating Scenario Tests page.

Editing an existing ScenarioTest

If you want to further refine the scenario test and get pass/fail insights for the model performance you can either (1) define a baseline model to compare agains (requires 2+ models to be uploaded) or (2) define a manual threshold for evaluation functions. Both approaches will be introduced below.

Defining a baseline model

You should go with this approach if you want to conduct relative model comparison and compare your new models against an existing baseline model. After setting the baseline model, all other models will be compared against the baseline model on all evaluation functions attached to the scenario test of interest.

# You can get started to list all of your available models in order to pick the baseline
  # From the returned models, pick the model_id of the model of choice and run
  scenario_test.set_baseline_model("prj_c6rjnmyejnvg078j12r0") # this model_id won't work, just a placeholder

Setting a pass/fail threshold

In case you not only want to compare against a baseline model but evaluate your model against an absolute threshold for pass/fail decisions, this threshold can also be defined using the Validate SDK.

# get the attached evaluation metrics
metrics = scenario_test.get_eval_functions()

# set a threshold for all of them
for m in metrics: