Description

Smart Sample allows you to filter and sample down to a highly relevant subset of data. Fine tune by applying filters on your Metadata and Autotags, and sample down the result by removing duplicates or applying random selection.

Pre-reqs: Dataset with embeddings

Steps

Open an indexed dataset.
Enter the Smart Sample menu by clicking "Smart Sample" in the top right.
Add Metadata and/or Autotag filters (see Filters). This step is optional.
Apply sampling if the Item Count (after filters, if any) is still too high.
a. Set your desired target number. This must be less than the Item Count after filters.
b. Choose a sampling strategy (see Sampling Strategies).
Click "Create Slice" in the bottom right and name the finished Slice.

Filters

Choose a filter option in 'Add Filter'.
For an Autotag filter, you can choose the range of Autotag scores that you want to filter the dataset down to. The lower the score, the less relevant the item is to the Autotag.
For a Metadata filter, you can choose the image metadata key you want to compare against and set a value.
When your filter is valid, the item count will be updated on the right hand side of the filter. This new value will represent the sample size after applying your filter and all the filters above it.

Sampling Strategies

🚧
Sampling Strategies are applied after all filters are applied first, if any; i.e. Nucleus samples the filter results.
Also, sampling is only applied when the target number is less than the Item Count after filters, if any.

Uniqueness Sampling

Removes any near-duplicates and samples down to the most representative subset of the filter results.

Similar to active learning, Nucleus will curate the subset of data that would best improve model performance based on embeddings. The default embeddings are based on CLIP, but we highly recommend supplying your own custom embeddings.

Random Sampling

Randomly selects from the filter results until the target number is reached.

Highest and Lowest Sampling

Sorts the filter results based on a metadata field of your choosing, then selects starting from the top/highest or bottom/lowest until the target number is reached.

Null metadata values will be selected last in both cases.