ScenarioTest is a way to monitor model performance in critical scenarios. Each test is defined on a subset (
Slice) of data and can consist of multiple evaluation metrics. Within a test user can either compare the model performance against other baseline models (e.g. on which metrics is model X better than model Y) or hard thresholds (e.g. is the IoU of model X > 0.8 on the slice of interest).
In this guide we'll create a scenario test on a fictional
Slice with several evaluation functions and walk through how baseline models and pass/fail thresholds can be set.
You can use the Validate Python SDK via the
NucleusClient.validate module. You set up the SDK exactly like you would when interacting with Nucleus (see the Getting Started section). You can list existing scenario tests (will be empty in the beginning) and available evaluation functions.
import nucleus client = nucleus.NucleusClient(YOUR_SCALE_API_KEY) existing_tests = client.validate.scenario_tests print(existing_tests) eval_functions = client.validate.eval_functions print(eval_functions)
We select the
Slice we want as the basis of our
ScenarioTest data as we would normally select slices from the SDK. You can also find the Slice ID by viewing the slice in the UI and copying the
slc_... ID from the URL.
# NOTE: This slice does not exist, please update with a valid # slice ID from your dataset pedestrians_slice = client.get_slice("slc_c2dfzaxyr4kh0na1ms")
Alternatively, you can list the slices associated with a given dataset.
# NOTE: This dataset does not exist, please update with a valid dataset ID dataset = client.get_dataset("ds_c6k9faxtz45009103xz0") dataset.info
Validate comes with a growing set of standard evaluation functions which are listed on the
AvailableEvaluationFunctions object returned by
validate.eval_functions.public_functions. These can all be found as members of the
client.validate.eval_functions object. If the public evaluation functions don't satisfy your needs and you need to define more private evaluation functions, please contact us and we'll help you to get the private evaluation functions set up.
We currently support the following list of evaluation functions:
- 2D object detection: bounding box precision, recall, IOU, mAP
- 3D object detection: cuboid precision, recall, IOU (both 3D and birds-eye-view 2D)
- Image categorization: Categorical F1 score
We finally create the
ScenarioTest with the pedestrians slice and the criterion for the IOU and mean average precision evaluation functions. We add the two additional metrics in a subsequent step. Note, that all of them could be added at once in the list. Each test needs to at least contain one evaluation function upon instantiation.
scenario_test = client.validate.create_scenario_test( name="Pedestrians on a crosswalk", slice_id=pedestrians_slice.id, evaluation_functions=[ client.validate.eval_functions.bbox_iou(), client.validate.eval_functions.bbox_map() ] ) scenario_test.add_eval_function(client.validate.eval_functions.bbox_precision()) scenario_test.add_eval_function(client.validate.eval_functions.bbox_recall()) scenario_test.get_eval_functions()
Once setup, the evaluations on the test can be easily run as described in the Evaluating Scenario Tests page.
If you want to further refine the scenario test and get pass/fail insights for the model performance you can either (1) define a baseline model to compare agains (requires 2+ models to be uploaded) or (2) define a manual threshold for evaluation functions. Both approaches will be introduced below.
You should go with this approach if you want to conduct relative model comparison and compare your new models against an existing baseline model. After setting the baseline model, all other models will be compared against the baseline model on all evaluation functions attached to the scenario test of interest.
# You can get started to list all of your available models in order to pick the baseline client.list_models() # From the returned models, pick the model_id of the model of choice and run scenario_test.set_baseline_model("prj_c6rjnmyejnvg078j12r0") # this model_id won't work, just a placeholder
In case you not only want to compare against a baseline model but evaluate your model against an absolute threshold for pass/fail decisions, this threshold can also be defined using the Validate SDK.
# get the attached evaluation metrics metrics = scenario_test.get_eval_functions() # set a threshold for all of them for m in metrics: m.set_threshold(0.6)
Updated 8 months ago