Updating Metadata for Scale-Imported Data

In this tutorial, we'll discuss how to update metadata for DatasetItems that were imported from a Scale labeling project.

At a high level, the only requirement is a mapping from filepath to metadata fields to update. This filepath should be the same one initially used to upload your data to the Scale labeling project, e.g. s3://path/to/file.jpg. The metadata fields to update should take the form of a Python dict, e.g. {"color": "red", "new_field": "new_value"}.

📘

You can either add new metadata fields or update existing metadata field values.

If the field (key in the metadata dict) does not yet exist, the Nucleus API will append the new key-value pair as a queryable metadata field to the item.

If the field already exists on the item, the Nucleus API will replace the old value with the newly supplied value.

Currently, the Nucleus API does not support deletion of metadata fields.

Suppose we have a metadata field color which can take on values red, blue, yellow, pink, or green that we want to attach to each image in our Dataset (which was imported from Scale).

  1. Internally, construct a mapping from each image filepath to the metadata fields you wish to add/update, e.g.:
    filepath_to_metadata = {
        "s3://some/path/image_0.jpg": {"color": "red"},
        "s3://some/path/image_1.jpg": {"color": "blue"},
        ...
    }
    
  2. Iterate through your DatasetItems via Nucleus API (API Reference).
  3. Retrieve the image_location for each DatasetItem (or pointcloud_location for pointclouds).
  4. Use the mapping from (1) to map to construct a new Python dict mapping each DatasetItem.reference_id to its dict of metadata field to update, e.g.:
    refid_to_metadata = {}
    for item in dataset.items_generator():
        refid_to_metadata[item.reference_id] = filepath_to_metadata[item.image_location]
    
    The reference ID -> new metadata dict mapping should look like this:
    >>> print(refid_to_metadata)
    {
        "61e878916666940043f06d20": {"color": "red"},
        "61e878916666940043f06d21": {"color": "blue"},
        ...
    }
    
  5. Use Dataset.update_item_metadata and pass in the dict from (4).

Below is an example of the full pipeline code to update metadata for images imported from a Scale labeling project:

import nucleus

# === Step 1 ===
# Construct mapping: filepath -> dict of metadata values to add/update (e.g. color)
filepath_to_metadata = {
  "s3://some/path/image_0.jpg": {"color": "red"},
  "s3://some/path/image_1.jpg": {"color": "blue"},
  "s3://some/path/image_2.jpg": {"color": "yellow"},
  "s3://some/path/image_3.jpg": {"color": "red", "new_field": "foo"},
  "s3://some/path/image_4.jpg": {"color": "pink", "new_field": "bar"},
}
# alternatively, you can define a function
def get_new_metadata_for_filepath(filepath: str) -> str:
  """Fetches the corresponding metadata fields for a given filepath."""
  pass


# === Steps 2-4 ===
client = nucleus.NucleusClient("YOUR_SCALE_API_KEY")
imported_dataset = nucleus.get_dataset("YOUR_DATASET_ID")

refid_to_metadata = {
  item.reference_id: filepath_to_metadata[item.image_location]
  for item in imported_dataset.items_generator()
}


# === Step 5 ===
imported_dataset.update_item_metadata(refid_to_metadata)

As a sanity check, DatasetItems imported from Scale tasks will have reference_id = task_id (which takes the form of an arbitrary hash, e.g. 61e878916666940043f06d20). Thus you can retrieve the same reference ID (task ID) -> filepath mapping using Scale labeling APIs as well, rather than retrieving it via Nucleus's DatasetItem export API.

You can then compose this reference ID -> filepath mapping with your filepath -> new metadata mapping to form the desired mapping: reference ID -> new metadata for use with Dataset.update_item_metadata.